[BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage
@ 2008-12-02 11:06 Toshiyuki Okajima
  2008-12-08 14:01 ` Theodore Tso
  0 siblings, 1 reply; 25+ messages in thread
From: Toshiyuki Okajima @ 2008-12-02 11:06 UTC (permalink / raw)
  To: aneesh.kumar, balbir, tytso; +Cc: linux-ext4

ext3: fix a cause of __schedule_bug via blkdev_releasepage
From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

A cause of this problem is calling log_wait_commit() on 
journal_try_to_free_buffers() with a read-lock via blkdev_releasepage(). This 
logic is for uncommitted data buffers. And a read/write-lock is required for a 
client usage of blkdev_releasepage.

By the way, we want to release only metadata buffers on ext3_release_metadata().
Because a page which binds to blkdev is used as metadata for ext3.

Therefore we don't need to wait for a commit on journal_try_to_free_buffers() 
via ext3_release_matadata().
As a result, we add a journal_try_to_free_metadata_buffers() almost same 
as journal_try_to_free_buffers() except not calling log_wait_commit.
So, we change journal_try_to_free_buffers() with 
journal_try_to_free_metadata_buffers().

This issue was reported by Aneesh Kumar K.V.
http://marc.info/?l=linux-ext4&m=122814568309893&w=2

Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com> 
Cc: "Theodore Ts'o" <tytso@mit.edu>
--
 fs/ext3/inode.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -Nurp linux-2.6.28-rc6/fs/ext3/inode.c linux-2.6.28-rc6.2/fs/ext3/inode.c
--- linux-2.6.28-rc6/fs/ext3/inode.c	2008-12-02 09:32:27.000000000 +0900
+++ linux-2.6.28-rc6.2/fs/ext3/inode.c	2008-12-02 10:19:53.000000000 +0900
@@ -1696,7 +1696,7 @@ int ext3_release_metadata(void *client, 
 	BUG_ON(EXT3_SB(sb) == NULL);
 	journal = EXT3_SB(sb)->s_journal;
 	if (journal != NULL)
-		return journal_try_to_free_buffers(journal, page, wait);
+		return journal_try_to_free_metadata_buffers(journal, page, wait);
 	else
 		return try_to_free_buffers(page);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage
  2008-12-02 11:06 [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
@ 2008-12-08 14:01 ` Theodore Tso
  2008-12-08 14:06   ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
  2008-12-12  0:54   ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Tso @ 2008-12-08 14:01 UTC (permalink / raw)
  To: Toshiyuki Okajima; +Cc: aneesh.kumar, balbir, linux-ext4

Toshiyuki-san,

My apologies for not having a chance to review your patches; things
have been rather busy for me.  There were a couple of shortcomings in
your patch series, and I think there is a better way of solving the
issue at hand.  First of all, patches #1 and #2 use a new function
which is not actually defined until patches #3 and #4, respectively.
This can make it difficult for people who are trying to use "git
bisect" to try to localize the root cause of a problem.

Secondly, the introduction of a large number of wrapper functions
increases stack utilization, and makes the call depth deeper (and in
the long run, each increase in call depth makes the code that much
harder to trace through and understand), and so it should be done only
as last resort.  Fortunately, there is a simpler way of fixing this
problem.  I include the inter-diff below, but I will fold this into
the current blkdev_releasepage() patches that are in the ext4 patch
queue.

Best regards,

					- Ted

P.S.  Note that this patch is functionally identical to what you
proposed in your patch series, but since the gfp_wait mask already
controlls whether or not log_wait_commit() is called, instead of
introducing a new functional parameter, we just mask off __GFP_WAIT
before calling jbd2_journal_try_to_free_buffers.  I'll make a similar
change to the ext3 patch, and attach the two revised patches to this
mail thread.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e77a059..543f3d0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3049,6 +3049,12 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
  * mapped via the block device.  Since these pages could have journal heads
  * which would prevent try_to_free_buffers() from freeing them, we must use
  * jbd2 layer's try_to_free_buffers() function to release them.
+ *
+ * Note: we have to strip the __GFP_WAIT flag before calling
+ * jbd2_journal_try_to_free_buffers because blkdev_releasepage is
+ * called while holding a spinlock (bdev_inode.client_lock).
+ * Fortunately the metadata buffers we are interested are freed right
+ * away and do not require calling journal_wait_for_transaction_sync_data().
  */
 int ext4_release_metadata(void *client, struct page *page, gfp_t wait)
 {
@@ -3061,7 +3067,8 @@ int ext4_release_metadata(void *client, struct page *page, gfp_t wait)
 	BUG_ON(EXT4_SB(sb) == NULL);
 	journal = EXT4_SB(sb)->s_journal;
 	if (journal != NULL)
-		return jbd2_journal_try_to_free_buffers(journal, page, wait);
+		return jbd2_journal_try_to_free_buffers(journal, page,
+							wait & ~__GFP_WAIT);
 	else
 		return try_to_free_buffers(page);
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -V2] ext3: provide function to release metadata pages under memory pressure
  2008-12-08 14:01 ` Theodore Tso
@ 2008-12-08 14:06   ` Theodore Ts'o
  2008-12-08 14:06     ` [PATCH -V2] ext4: " Theodore Ts'o
  2008-12-12  0:54   ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
  1 sibling, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2008-12-08 14:06 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: aneesh.kumar, balbir, Toshiyuki Okajima, Theodore Ts'o,
	linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext3 data files are released via
the ext3_releasepage() function specified in the ext3 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext3/inode.c         |   29 +++++++++++++++++++++++++++++
 fs/ext3/super.c         |    2 ++
 include/linux/ext3_fs.h |    2 ++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f8424ad..89c065a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1680,6 +1680,35 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
 }
 
 /*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd layer's try_to_free_buffers() function to release them.
+ *
+ * Note: we have to strip the __GFP_WAIT flag before calling
+ * journal_try_to_free_buffers because blkdev_releasepage is
+ * called while holding a spinlock (bdev_inode.client_lock).
+ * Fortunately the metadata buffers we are interested are freed right
+ * away and do not require calling journal_wait_for_transaction_sync_data().
+ */
+int ext3_release_metadata(void *client, struct page *page, gfp_t wait)
+{
+	struct super_block *sb = (struct super_block*)client;
+	journal_t *journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	BUG_ON(EXT3_SB(sb) == NULL);
+	journal = EXT3_SB(sb)->s_journal;
+	if (journal != NULL)
+		return journal_try_to_free_buffers(journal, page,
+						   wait & ~__GFP_WAIT);
+	else
+		return try_to_free_buffers(page);
+}
+
+/*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
  * if the machine crashes during the write.
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 541d5e4..4428149 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2981,6 +2981,8 @@ static struct file_system_type ext3_fs_type = {
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata 
+			= ext3_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 9004794..3e3052f 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -867,6 +867,8 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext3_release_metadata(void *client, struct page *page, 
+			gfp_t wait);
 
 /* ioctl.c */
 extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -V2] ext4: provide function to release metadata pages under memory pressure
  2008-12-08 14:06   ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2008-12-08 14:06     ` Theodore Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2008-12-08 14:06 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: aneesh.kumar, balbir, Toshiyuki Okajima, Theodore Ts'o,
	linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext4/ext4.h  |    2 ++
 fs/ext4/inode.c |   29 +++++++++++++++++++++++++++++
 fs/ext4/super.c |    4 ++++
 3 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 695b45c..91e06e4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1099,6 +1099,8 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
+extern int ext4_release_metadata(void *client, struct page *page,
+		gfp_t wait);
 
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f6d9447..1647903 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3045,6 +3045,35 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 }
 
 /*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd2 layer's try_to_free_buffers() function to release them.
+ *
+ * Note: we have to strip the __GFP_WAIT flag before calling
+ * jbd2_journal_try_to_free_buffers because blkdev_releasepage is
+ * called while holding a spinlock (bdev_inode.client_lock).
+ * Fortunately the metadata buffers we are interested are freed right
+ * away and do not require calling journal_wait_for_transaction_sync_data().
+ */
+int ext4_release_metadata(void *client, struct page *page, gfp_t wait)
+{
+	struct super_block *sb = (struct super_block*)client;
+	journal_t *journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	BUG_ON(EXT4_SB(sb) == NULL);
+	journal = EXT4_SB(sb)->s_journal;
+	if (journal != NULL)
+		return jbd2_journal_try_to_free_buffers(journal, page,
+							wait & ~__GFP_WAIT);
+	else
+		return try_to_free_buffers(page);
+}
+
+/*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
  * if the machine crashes during the write.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bd41fad..f447c46 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3689,6 +3689,8 @@ static struct file_system_type ext4_fs_type = {
 	.name		= "ext4",
 	.get_sb		= ext4_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata
+			= ext4_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 
@@ -3708,6 +3710,8 @@ static struct file_system_type ext4dev_fs_type = {
 	.name		= "ext4dev",
 	.get_sb		= ext4dev_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata
+			= ext4_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 MODULE_ALIAS("ext4dev");
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage
  2008-12-08 14:01 ` Theodore Tso
  2008-12-08 14:06   ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2008-12-12  0:54   ` Toshiyuki Okajima
  2008-12-12  6:21     ` Theodore Tso
  1 sibling, 1 reply; 25+ messages in thread
From: Toshiyuki Okajima @ 2008-12-12  0:54 UTC (permalink / raw)
  To: Theodore Tso; +Cc: aneesh.kumar, balbir, linux-ext4

Ted-san,
Thank you for your reviewing.

 > Toshiyuki-san,
 >
 > My apologies for not having a chance to review your patches; things
 > have been rather busy for me.  There were a couple of shortcomings in
 > your patch series, and I think there is a better way of solving the
 > issue at hand.  First of all, patches #1 and #2 use a new function
 > which is not actually defined until patches #3 and #4, respectively.
 > This can make it difficult for people who are trying to use "git
 > bisect" to try to localize the root cause of a problem.
 >
 > Secondly, the introduction of a large number of wrapper functions
 > increases stack utilization, and makes the call depth deeper (and in
 > the long run, each increase in call depth makes the code that much
 > harder to trace through and understand), and so it should be done only
 > as last resort.  Fortunately, there is a simpler way of fixing this
 > problem.  I include the inter-diff below, but I will fold this into
 > the current blkdev_releasepage() patches that are in the ext4 patch
 > queue.
 >
 > Best regards,
 >
 > 					- Ted
 >
 > P.S.  Note that this patch is functionally identical to what you
 > proposed in your patch series, but since the gfp_wait mask already
 > controlls whether or not log_wait_commit() is called, instead of
 > introducing a new functional parameter, we just mask off __GFP_WAIT
 > before calling jbd2_journal_try_to_free_buffers.  I'll make a similar
 > change to the ext3 patch, and attach the two revised patches to this
 > mail thread.

Through the idea as follows,  I agree to your change.

-----------------------------------------------------------------------------
To tell the truth, at first, I imagined the same patch as yours to fix this
problem. But I have made another patch because I thought that ext3(or ext4)
should not know the contents of the processing of journal_try_to_free_buffers
  in detail. (ext3 should not know there is a possibility to call
journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.)
So, I have made a new function, journal_try_to_free_metadata_buffers
to release only metadata buffer_heads.
(I wanted a function which is the almost same as journal_try_to_free_buffers
except calling journal_wait_for_transaction_sync_data from it.)
However, this new function needed big changes, you know.

I reconsidered what was the most suitable patch to fix this problem
after I read your mail (patch).
And then, I thought that it was important to make the most concise patch
to fix only a root cause. Big patch is not easy to understand even if it is
more logical one.

Therefore, there is the fact that ext3_release_metadata must not sleep because
it can get a spinlock, and then, only changing ext3_release_metadata to the
logic to make it not sleep is the simplest fix for this problem.
-----------------------------------------------------------------------------

Best Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage
  2008-12-12  0:54   ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
@ 2008-12-12  6:21     ` Theodore Tso
  2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
  2008-12-15  2:21       ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Tso @ 2008-12-12  6:21 UTC (permalink / raw)
  To: Toshiyuki Okajima; +Cc: aneesh.kumar, balbir, linux-ext4

On Fri, Dec 12, 2008 at 09:54:18AM +0900, Toshiyuki Okajima wrote:
> To tell the truth, at first, I imagined the same patch as yours to fix this
> problem. But I have made another patch because I thought that ext3(or ext4)
> should not know the contents of the processing of journal_try_to_free_buffers
>  in detail. (ext3 should not know there is a possibility to call
> journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.)

I agree, but ext3 doesn't need to know that.  What my change did was
to mask off the _GFP_WAIT flag, which prohibits the function it calls
from blocking, because it knows that its caller is holding a spinlock.

And actually, come to think of it.  We can do even better; the right
fix is to have blkdev_releasepage() mask off the _GFP_WAIT flag; this
is the function which is taking spinlock, and by masking off the
__GFP_WAIT flag, this is simply requesting all of the downstream
functions not to block, but to do the best job they can do without
blocking.  It doesn't need to know whether it's going to call
log_wait_commit(), or anything else; all it needs to do is request
"please don't block".

That means we only make the request once, in the function which is
taking spinlock, so all of the per-filesystem implementations of
release_metadata() don't need to know that its caller is holding a
spinlock. 

							- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-12  6:21     ` Theodore Tso
@ 2008-12-12 17:52       ` Theodore Ts'o
  2008-12-12 17:52         ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
                           ` (2 more replies)
  2008-12-15  2:21       ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
  1 sibling, 3 replies; 25+ messages in thread
From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Implement blkdev_releasepage() to release the buffer_heads and page
after we release private data which belongs to a client of the block
device, such as a filesystem.

blkdev_releasepage() call the client's releasepage() which is
registered by blkdev_register_client_releasepage() to release its
private data.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/block_dev.c     |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/super.c         |   22 ++++++++++++++++
 include/linux/fs.h |    9 +++++++
 3 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index db831ef..bac0a38 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -29,6 +29,9 @@
 
 struct bdev_inode {
 	struct block_device bdev;
+	void *client;
+	int (*client_releasepage)(void*, struct page*, gfp_t);
+	rwlock_t client_lock;
 	struct inode vfs_inode;
 };
 
@@ -260,6 +263,9 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
+	ei->client = NULL;
+	ei->client_releasepage = NULL;
+	rwlock_init(&ei->client_lock);
 	return &ei->vfs_inode;
 }
 
@@ -1208,6 +1214,67 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+/*
+ * blkdev_releasepage: execute ei->client_releasepage() if it exists.
+ * Otherwise, execute try_to_free_buffers().
+ * ei->client_releasepage() releases private client's page if possible.
+ * Because a buffer_head's using counter is bigger than 0 if a client has
+ * a page for private usage. If so, try_to_free_buffers() cannot release it.
+ * Therefore a client must try to release a page itself.
+ */
+static int blkdev_releasepage(struct page *page, gfp_t wait)
+{
+	struct bdev_inode *ei = BDEV_I(page->mapping->host);
+	int ret;
+
+	read_lock(&ei->client_lock);
+	if (ei->client_releasepage != NULL)
+		/*
+		 * Since we are holding a spinlock (ei->client_lock),
+		 * make sure the client_releasepage function
+		 * understands that it must not block.
+		 */
+		ret = (*ei->client_releasepage)(ei->client, page,
+						wait & ~__GFP_WAIT);
+	else
+		ret = try_to_free_buffers(page);
+	read_unlock(&ei->client_lock);
+	return ret;
+}
+
+/*
+ * blkdev_register_client_releasepage: register client_releasepage.
+ */
+int blkdev_register_client_releasepage(struct block_device *bdev,
+	void *client, int (*releasepage)(void*, struct page*, gfp_t))
+{
+	struct bdev_inode *ei = BDEV_I(bdev->bd_inode);
+	int ret = 1;
+
+	write_lock(&ei->client_lock);
+	if (ei->client == NULL && ei->client_releasepage == NULL) {
+		ei->client = client;
+		ei->client_releasepage = releasepage;
+	} else if (ei->client != client 
+			|| ei->client_releasepage != releasepage)
+		ret = 0;
+	write_unlock(&ei->client_lock);
+	return ret;
+}
+
+/*
+ * blkdev_unregister_client_releasepage: unregister client_releasepage.
+ */
+void blkdev_unregister_client_releasepage(struct block_device *bdev)
+{
+	struct bdev_inode *ei = BDEV_I(bdev->bd_inode);
+
+	write_lock(&ei->client_lock);
+	ei->client = NULL;
+	ei->client_releasepage = NULL;
+	write_unlock(&ei->client_lock);
+}
+
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
@@ -1215,6 +1282,7 @@ static const struct address_space_operations def_blk_aops = {
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
 	.writepages	= generic_writepages,
+	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
 };
 
diff --git a/fs/super.c b/fs/super.c
index 400a760..fd254eb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -801,6 +801,18 @@ int get_sb_bdev(struct file_system_type *fs_type,
 
 		s->s_flags |= MS_ACTIVE;
 	}
+	/*
+	 * register a client function which releases a page whose mapping is
+	 * block device 
+	 */
+	if (fs_type->release_metadata != NULL
+		&& !blkdev_register_client_releasepage(bdev, s,
+			fs_type->release_metadata)) {
+		up_write(&s->s_umount);
+		deactivate_super(s);
+		error = -EBUSY;
+		goto error_bdev;
+	}
 
 	return simple_set_mnt(mnt, s);
 
@@ -819,6 +831,16 @@ void kill_block_super(struct super_block *sb)
 	struct block_device *bdev = sb->s_bdev;
 	fmode_t mode = sb->s_mode;
 
+	/*
+	 * unregister a client function which releases a page whose mapping is
+	 * block device
+	 *
+	 * This is sure to be unmounting here, and it releases all own data
+	 * itself. Therefore the filesystem's function which is owned by the
+	 * block device, which releases its data is not needed any more.
+	 */
+	if (sb->s_type->release_metadata != NULL)
+		blkdev_unregister_client_releasepage(bdev);
 	generic_shutdown_super(sb);
 	sync_blockdev(bdev);
 	close_bdev_exclusive(bdev, mode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..398c8ed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1538,6 +1538,7 @@ struct file_system_type {
 	int (*get_sb) (struct file_system_type *, int,
 		       const char *, void *, struct vfsmount *);
 	void (*kill_sb) (struct super_block *);
+	int (*release_metadata)(void*, struct page*, gfp_t);
 	struct module *owner;
 	struct file_system_type * next;
 	struct list_head fs_supers;
@@ -1699,8 +1700,16 @@ extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
 extern struct block_device *open_by_devnum(dev_t, fmode_t);
+extern int blkdev_register_client_releasepage(struct block_device *,
+	void *, int (*releasepage)(void *, struct page*, gfp_t));
+extern void blkdev_unregister_client_releasepage(struct block_device *);
 #else
 static inline void bd_forget(struct inode *inode) {}
+static inline int blkdev_register_client_releasepage(struct block_device *,
+	void *, int (*releasepage)(void *, struct page*, gfp_t))
+{ return 1; }
+static inline void blkdev_unregister_client_releasepage(struct block_device *)
+{}
 #endif
 extern const struct file_operations def_blk_fops;
 extern const struct file_operations def_chr_fops;
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -v3] ext3: provide function to release metadata pages under memory pressure
  2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
@ 2008-12-12 17:52         ` Theodore Ts'o
  2008-12-12 17:52           ` [PATCH -v3] ext4: " Theodore Ts'o
  2008-12-17 15:39         ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara
  2008-12-26  5:01         ` Al Viro
  2 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext3 data files are released via
the ext3_releasepage() function specified in the ext3 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext3/inode.c         |   22 ++++++++++++++++++++++
 fs/ext3/super.c         |    2 ++
 include/linux/ext3_fs.h |    2 ++
 3 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f8424ad..b1fecda 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1680,6 +1680,28 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
 }
 
 /*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd layer's try_to_free_buffers() function to release them.
+ */
+int ext3_release_metadata(void *client, struct page *page, gfp_t wait)
+{
+	struct super_block *sb = (struct super_block*)client;
+	journal_t *journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	BUG_ON(EXT3_SB(sb) == NULL);
+	journal = EXT3_SB(sb)->s_journal;
+	if (journal != NULL)
+		return journal_try_to_free_buffers(journal, page, wait);
+	else
+		return try_to_free_buffers(page);
+}
+
+/*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
  * if the machine crashes during the write.
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 541d5e4..4428149 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2981,6 +2981,8 @@ static struct file_system_type ext3_fs_type = {
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata 
+			= ext3_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 9004794..3e3052f 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -867,6 +867,8 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext3_release_metadata(void *client, struct page *page, 
+			gfp_t wait);
 
 /* ioctl.c */
 extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -v3] ext4: provide function to release metadata pages under memory pressure
  2008-12-12 17:52         ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2008-12-12 17:52           ` Theodore Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext4/ext4.h  |    2 ++
 fs/ext4/inode.c |   22 ++++++++++++++++++++++
 fs/ext4/super.c |    4 ++++
 3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 695b45c..91e06e4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1099,6 +1099,8 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
+extern int ext4_release_metadata(void *client, struct page *page,
+		gfp_t wait);
 
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f6d9447..e77a059 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3045,6 +3045,28 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 }
 
 /*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd2 layer's try_to_free_buffers() function to release them.
+ */
+int ext4_release_metadata(void *client, struct page *page, gfp_t wait)
+{
+	struct super_block *sb = (struct super_block*)client;
+	journal_t *journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	BUG_ON(EXT4_SB(sb) == NULL);
+	journal = EXT4_SB(sb)->s_journal;
+	if (journal != NULL)
+		return jbd2_journal_try_to_free_buffers(journal, page, wait);
+	else
+		return try_to_free_buffers(page);
+}
+
+/*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
  * if the machine crashes during the write.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bd41fad..f447c46 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3689,6 +3689,8 @@ static struct file_system_type ext4_fs_type = {
 	.name		= "ext4",
 	.get_sb		= ext4_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata
+			= ext4_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 
@@ -3708,6 +3710,8 @@ static struct file_system_type ext4dev_fs_type = {
 	.name		= "ext4dev",
 	.get_sb		= ext4dev_get_sb,
 	.kill_sb	= kill_block_super,
+	.release_metadata
+			= ext4_release_metadata,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
 MODULE_ALIAS("ext4dev");
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage
  2008-12-12  6:21     ` Theodore Tso
  2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
@ 2008-12-15  2:21       ` Toshiyuki Okajima
  1 sibling, 0 replies; 25+ messages in thread
From: Toshiyuki Okajima @ 2008-12-15  2:21 UTC (permalink / raw)
  To: Theodore Tso; +Cc: aneesh.kumar, balbir, linux-ext4

Ted-san,

Theodore Tso wrote:
 > On Fri, Dec 12, 2008 at 09:54:18AM +0900, Toshiyuki Okajima wrote:
 > > > To tell the truth, at first, I imagined the same patch as yours to fix this
 > > > problem. But I have made another patch because I thought that ext3(or ext4)
 > > > should not know the contents of the processing of journal_try_to_free_buffers
 > > >  in detail. (ext3 should not know there is a possibility to call
 > > > journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.)
 >
 > I agree, but ext3 doesn't need to know that.  What my change did was
 > to mask off the _GFP_WAIT flag, which prohibits the function it calls
 > from blocking, because it knows that its caller is holding a spinlock.
 >
 > And actually, come to think of it.  We can do even better; the right
 > fix is to have blkdev_releasepage() mask off the _GFP_WAIT flag; this
 > is the function which is taking spinlock, and by masking off the
 > __GFP_WAIT flag, this is simply requesting all of the downstream
 > functions not to block, but to do the best job they can do without
 > blocking.  It doesn't need to know whether it's going to call
 > log_wait_commit(), or anything else; all it needs to do is request
 > "please don't block".

 > That means we only make the request once, in the function which is
 > taking spinlock, so all of the per-filesystem implementations of
 > release_metadata() don't need to know that its caller is holding a
 > spinlock.

OK. I agree your last change.

I also think blkdev_releasepage must do something so that downstream
functions of it don't sleep. Masking off the _GFP_WAIT flag is the
easiest achievement of it.
Besides, I think it is not valid implementation that brings us a care
about ei->client_releasepage's sleeping.

Additional Information:
I did an easy test with your last change and then I haven't experienced
any errors.

Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
  2008-12-12 17:52         ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2008-12-17 15:39         ` Jan Kara
  2008-12-18  5:15           ` Toshiyuki Okajima
  2008-12-26  5:01         ` Al Viro
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2008-12-17 15:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, Toshiyuki Okajima, linux-fsdevel

  Hello,

> From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> 
> Implement blkdev_releasepage() to release the buffer_heads and page
> after we release private data which belongs to a client of the block
> device, such as a filesystem.
> 
> blkdev_releasepage() call the client's releasepage() which is
> registered by blkdev_register_client_releasepage() to release its
> private data.
  Yes, this is IMO the right fix. I'm just wondering about the fact that we
can't block in the client_releasepage(). That seems to be caused by the fact
that we need to be protected against client_releasepage() callback changes
which essentially means umount, right? I'm not saying I have a better solution
but introducing such limitation seems stupid just because of umount...

									Honza

> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/block_dev.c     |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/super.c         |   22 ++++++++++++++++
>  include/linux/fs.h |    9 +++++++
>  3 files changed, 99 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index db831ef..bac0a38 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -29,6 +29,9 @@
>  
>  struct bdev_inode {
>  	struct block_device bdev;
> +	void *client;
> +	int (*client_releasepage)(void*, struct page*, gfp_t);
> +	rwlock_t client_lock;
>  	struct inode vfs_inode;
>  };
>  
> @@ -260,6 +263,9 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
>  	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
>  	if (!ei)
>  		return NULL;
> +	ei->client = NULL;
> +	ei->client_releasepage = NULL;
> +	rwlock_init(&ei->client_lock);
>  	return &ei->vfs_inode;
>  }
>  
> @@ -1208,6 +1214,67 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  	return blkdev_ioctl(bdev, mode, cmd, arg);
>  }
>  
> +/*
> + * blkdev_releasepage: execute ei->client_releasepage() if it exists.
> + * Otherwise, execute try_to_free_buffers().
> + * ei->client_releasepage() releases private client's page if possible.
> + * Because a buffer_head's using counter is bigger than 0 if a client has
> + * a page for private usage. If so, try_to_free_buffers() cannot release it.
> + * Therefore a client must try to release a page itself.
> + */
> +static int blkdev_releasepage(struct page *page, gfp_t wait)
> +{
> +	struct bdev_inode *ei = BDEV_I(page->mapping->host);
> +	int ret;
> +
> +	read_lock(&ei->client_lock);
> +	if (ei->client_releasepage != NULL)
> +		/*
> +		 * Since we are holding a spinlock (ei->client_lock),
> +		 * make sure the client_releasepage function
> +		 * understands that it must not block.
> +		 */
> +		ret = (*ei->client_releasepage)(ei->client, page,
> +						wait & ~__GFP_WAIT);
> +	else
> +		ret = try_to_free_buffers(page);
> +	read_unlock(&ei->client_lock);
> +	return ret;
> +}
> +
> +/*
> + * blkdev_register_client_releasepage: register client_releasepage.
> + */
> +int blkdev_register_client_releasepage(struct block_device *bdev,
> +	void *client, int (*releasepage)(void*, struct page*, gfp_t))
> +{
> +	struct bdev_inode *ei = BDEV_I(bdev->bd_inode);
> +	int ret = 1;
> +
> +	write_lock(&ei->client_lock);
> +	if (ei->client == NULL && ei->client_releasepage == NULL) {
> +		ei->client = client;
> +		ei->client_releasepage = releasepage;
> +	} else if (ei->client != client 
> +			|| ei->client_releasepage != releasepage)
> +		ret = 0;
> +	write_unlock(&ei->client_lock);
> +	return ret;
> +}
> +
> +/*
> + * blkdev_unregister_client_releasepage: unregister client_releasepage.
> + */
> +void blkdev_unregister_client_releasepage(struct block_device *bdev)
> +{
> +	struct bdev_inode *ei = BDEV_I(bdev->bd_inode);
> +
> +	write_lock(&ei->client_lock);
> +	ei->client = NULL;
> +	ei->client_releasepage = NULL;
> +	write_unlock(&ei->client_lock);
> +}
> +
>  static const struct address_space_operations def_blk_aops = {
>  	.readpage	= blkdev_readpage,
>  	.writepage	= blkdev_writepage,
> @@ -1215,6 +1282,7 @@ static const struct address_space_operations def_blk_aops = {
>  	.write_begin	= blkdev_write_begin,
>  	.write_end	= blkdev_write_end,
>  	.writepages	= generic_writepages,
> +	.releasepage	= blkdev_releasepage,
>  	.direct_IO	= blkdev_direct_IO,
>  };
>  
> diff --git a/fs/super.c b/fs/super.c
> index 400a760..fd254eb 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -801,6 +801,18 @@ int get_sb_bdev(struct file_system_type *fs_type,
>  
>  		s->s_flags |= MS_ACTIVE;
>  	}
> +	/*
> +	 * register a client function which releases a page whose mapping is
> +	 * block device 
> +	 */
> +	if (fs_type->release_metadata != NULL
> +		&& !blkdev_register_client_releasepage(bdev, s,
> +			fs_type->release_metadata)) {
> +		up_write(&s->s_umount);
> +		deactivate_super(s);
> +		error = -EBUSY;
> +		goto error_bdev;
> +	}
>  
>  	return simple_set_mnt(mnt, s);
>  
> @@ -819,6 +831,16 @@ void kill_block_super(struct super_block *sb)
>  	struct block_device *bdev = sb->s_bdev;
>  	fmode_t mode = sb->s_mode;
>  
> +	/*
> +	 * unregister a client function which releases a page whose mapping is
> +	 * block device
> +	 *
> +	 * This is sure to be unmounting here, and it releases all own data
> +	 * itself. Therefore the filesystem's function which is owned by the
> +	 * block device, which releases its data is not needed any more.
> +	 */
> +	if (sb->s_type->release_metadata != NULL)
> +		blkdev_unregister_client_releasepage(bdev);
>  	generic_shutdown_super(sb);
>  	sync_blockdev(bdev);
>  	close_bdev_exclusive(bdev, mode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0dcdd94..398c8ed 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1538,6 +1538,7 @@ struct file_system_type {
>  	int (*get_sb) (struct file_system_type *, int,
>  		       const char *, void *, struct vfsmount *);
>  	void (*kill_sb) (struct super_block *);
> +	int (*release_metadata)(void*, struct page*, gfp_t);
>  	struct module *owner;
>  	struct file_system_type * next;
>  	struct list_head fs_supers;
> @@ -1699,8 +1700,16 @@ extern void bd_set_size(struct block_device *, loff_t size);
>  extern void bd_forget(struct inode *inode);
>  extern void bdput(struct block_device *);
>  extern struct block_device *open_by_devnum(dev_t, fmode_t);
> +extern int blkdev_register_client_releasepage(struct block_device *,
> +	void *, int (*releasepage)(void *, struct page*, gfp_t));
> +extern void blkdev_unregister_client_releasepage(struct block_device *);
>  #else
>  static inline void bd_forget(struct inode *inode) {}
> +static inline int blkdev_register_client_releasepage(struct block_device *,
> +	void *, int (*releasepage)(void *, struct page*, gfp_t))
> +{ return 1; }
> +static inline void blkdev_unregister_client_releasepage(struct block_device *)
> +{}
>  #endif
>  extern const struct file_operations def_blk_fops;
>  extern const struct file_operations def_chr_fops;
> -- 
> 1.6.0.4.8.g36f27.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-17 15:39         ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara
@ 2008-12-18  5:15           ` Toshiyuki Okajima
  2008-12-18 13:12             ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Toshiyuki Okajima @ 2008-12-18  5:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel

Hi,
 >   Hello,
 >
 > > > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
 > > >
 > > > Implement blkdev_releasepage() to release the buffer_heads and page
 > > > after we release private data which belongs to a client of the block
 > > > device, such as a filesystem.
 > > >
 > > > blkdev_releasepage() call the client's releasepage() which is
 > > > registered by blkdev_register_client_releasepage() to release its
 > > > private data.
 >   Yes, this is IMO the right fix. I'm just wondering about the fact that we
 > can't block in the client_releasepage(). That seems to be caused by the fact
 > that we need to be protected against client_releasepage() callback changes
 > which essentially means umount, right? I'm not saying I have a better solution
 > but introducing such limitation seems stupid just because of umount...
 >
 > 									Honza
 >

Difference between v2 and v3 in blkdev_releasepage:
<		ret = (*ei->client_releasepage)(ei->client, page, wait);
<	else
--
 >		/*
 >		 * Since we are holding a spinlock (ei->client_lock),
 >		 * make sure the client_releasepage function
 >		 * understands that it must not block.
 >		 */
 >		ret = (*ei->client_releasepage)(ei->client, page,
 >						wait & ~__GFP_WAIT);
 >	else

Ask for clarification.

Which of the following do you mean:
1) If using a spinlock in client_releasepage() is only for mount/umount,
  this implementation is not wise.
2) There is the fact that a spinlock is necessary for blkdev_releasepage().
This fact prevents us from making various implementations of
client_releasepage().
(Without a spinlock, we can implement a client_releasepage() which can release
the buffers with a sleep. As a result, it may enable more buffers release than
before.)

There is the fact that a filesystem can be mounted on several places,
and the lock mechanism is absolutely necessary for this fact.

I also think we are sad that we cannot implement various implementations for
client_releasepage(). But now I cannot imagine what to do for
a client_releasepage() which can sleep, too...

Regards,
Toshiyuki Okajima

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-18  5:15           ` Toshiyuki Okajima
@ 2008-12-18 13:12             ` Jan Kara
  2008-12-18 14:54               ` Theodore Tso
  2008-12-19  5:15               ` Toshiyuki Okajima
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kara @ 2008-12-18 13:12 UTC (permalink / raw)
  To: Toshiyuki Okajima; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel

  Hello,

On Thu 18-12-08 14:15:25, Toshiyuki Okajima wrote:
> > > > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> > > >
> > > > Implement blkdev_releasepage() to release the buffer_heads and page
> > > > after we release private data which belongs to a client of the block
> > > > device, such as a filesystem.
> > > >
> > > > blkdev_releasepage() call the client's releasepage() which is
> > > > registered by blkdev_register_client_releasepage() to release its
> > > > private data.
> >   Yes, this is IMO the right fix. I'm just wondering about the fact that we
> > can't block in the client_releasepage(). That seems to be caused by the fact
> > that we need to be protected against client_releasepage() callback changes
> > which essentially means umount, right? I'm not saying I have a better solution
> > but introducing such limitation seems stupid just because of umount...
> >
> Difference between v2 and v3 in blkdev_releasepage:
> <		ret = (*ei->client_releasepage)(ei->client, page, wait);
> <	else
> --
> >		/*
> >		 * Since we are holding a spinlock (ei->client_lock),
> >		 * make sure the client_releasepage function
> >		 * understands that it must not block.
> >		 */
> >		ret = (*ei->client_releasepage)(ei->client, page,
> >						wait & ~__GFP_WAIT);
> >	else
>
> Ask for clarification.
  Yes, my question was more about the original design of the patch than
about the particular fix. Sorry for the confusion.

> Which of the following do you mean:
> 1) If using a spinlock in client_releasepage() is only for mount/umount,
>  this implementation is not wise.
> 2) There is the fact that a spinlock is necessary for blkdev_releasepage().
> This fact prevents us from making various implementations of
> client_releasepage().
> (Without a spinlock, we can implement a client_releasepage() which can release
> the buffers with a sleep. As a result, it may enable more buffers release than
> before.)
>
> There is the fact that a filesystem can be mounted on several places,
> and the lock mechanism is absolutely necessary for this fact.
  This is the thing I was wondering about. Why exactly is the spinlock
necessary for blkdev_releasepage()? I understand we have to protect
reading client_releasepage() pointer because it could change but my point
was that it changes only during mount / umount.

> I also think we are sad that we cannot implement various implementations for
> client_releasepage(). But now I cannot imagine what to do for
> a client_releasepage() which can sleep, too...

							Regards
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-18 13:12             ` Jan Kara
@ 2008-12-18 14:54               ` Theodore Tso
  2008-12-18 16:38                 ` Jan Kara
  2008-12-19  5:15               ` Toshiyuki Okajima
  1 sibling, 1 reply; 25+ messages in thread
From: Theodore Tso @ 2008-12-18 14:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: Toshiyuki Okajima, Ext4 Developers List, linux-fsdevel

On Thu, Dec 18, 2008 at 02:12:34PM +0100, Jan Kara wrote:
>   This is the thing I was wondering about. Why exactly is the spinlock
> necessary for blkdev_releasepage()? I understand we have to protect
> reading client_releasepage() pointer because it could change but my point
> was that it changes only during mount / umount.

Hmm.... I suppose we could use RCU, but then we'd have to worry about
the race condition where client_releasepage() gets called after the
umount has happened.

> > I also think we are sad that we cannot implement various
> > implementations for client_releasepage(). But now I cannot imagine
> > what to do for a client_releasepage() which can sleep, too...

My suggestion is that we not worry about making changes to
fs/block_dev.c to allow client_releasepage() to sleep until we have
filesystems that really need client_releasepage() to sleep.  It
probably is possible, with appropriate atomic bit sets for flags to
indicate an unmount in progress, and client_releasepage in progress,
and use of RCU, we could allow client_releasepage.  But it might not
be worth it unless there is a filesystem that really needs it.

   	    	   	      		      	  - Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-18 14:54               ` Theodore Tso
@ 2008-12-18 16:38                 ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2008-12-18 16:38 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Toshiyuki Okajima, Ext4 Developers List, linux-fsdevel

On Thu 18-12-08 09:54:00, Theodore Tso wrote:
> On Thu, Dec 18, 2008 at 02:12:34PM +0100, Jan Kara wrote:
> >   This is the thing I was wondering about. Why exactly is the spinlock
> > necessary for blkdev_releasepage()? I understand we have to protect
> > reading client_releasepage() pointer because it could change but my point
> > was that it changes only during mount / umount.
> 
> Hmm.... I suppose we could use RCU, but then we'd have to worry about
> the race condition where client_releasepage() gets called after the
> umount has happened.
  Yes. Actually, I'm not so much against spinlock for obtaining the
function pointer but you needn't hold it when you actually call the
function. Except that you have to care about those umount races, I agree.

> > > I also think we are sad that we cannot implement various
> > > implementations for client_releasepage(). But now I cannot imagine
> > > what to do for a client_releasepage() which can sleep, too...
> 
> My suggestion is that we not worry about making changes to
> fs/block_dev.c to allow client_releasepage() to sleep until we have
> filesystems that really need client_releasepage() to sleep.  It
> probably is possible, with appropriate atomic bit sets for flags to
> indicate an unmount in progress, and client_releasepage in progress,
> and use of RCU, we could allow client_releasepage.  But it might not
> be worth it unless there is a filesystem that really needs it.
  OK. Actually, there is a possibility when ext3/4 might want wait - if
someone does direct IO on the block device, it might fail with EIO because
the page is pinned by JBD and we have to wait for it. But OTOH doing dio
on a block device with a mounted filesystem is *really* a stupid thing to do
so I agree we don't care.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-18 13:12             ` Jan Kara
  2008-12-18 14:54               ` Theodore Tso
@ 2008-12-19  5:15               ` Toshiyuki Okajima
  1 sibling, 0 replies; 25+ messages in thread
From: Toshiyuki Okajima @ 2008-12-19  5:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel

Hello.

Jan Kara wrote:
 >   Hello,
 >
<SNIP>
 > > Which of the following do you mean:
 > > 1) If using a spinlock in client_releasepage() is only for mount/umount,
 > >  this implementation is not wise.
 > > 2) There is the fact that a spinlock is necessary for blkdev_releasepage().
 > > This fact prevents us from making various implementations of
 > > client_releasepage().
 > > (Without a spinlock, we can implement a client_releasepage() which can release
 > > the buffers with a sleep. As a result, it may enable more buffers release than
 > > before.)
 > >
 > > There is the fact that a filesystem can be mounted on several places,
 > > and the lock mechanism is absolutely necessary for this fact.
 >   This is the thing I was wondering about. Why exactly is the spinlock
 > necessary for blkdev_releasepage()? I understand we have to protect
 > reading client_releasepage() pointer because it could change but my point
 > was that it changes only during mount / umount.

There are 2 purposes of this lock.
1) The race between filesystem's mount and umount.
(So that a filesystem can be mounted on several places concurrently.)
------------------------------------------------------------------
Without this lock, there is a possibility that the pointer of
ei->client_releasepage becomes NULL by umount.
As a result, a special releasepage for its filesystem is not used even if its
filesystem has been mounted.
------------------------------------------------------------------

2) The race between the usage of blkdev_releasepage() and umount.
------------------------------------------------------------------
Without this lock, there is a possibility that the pointer of
ei->client_releasepage becomes NULL by umount.
As a result, the process which calls blkdev_releasepage() may experience a page
  fault. Because blkdev_releasepage() refers the value ei->client_releasepage
and then calls it as a function.

But even if the pointer is not NULL, there is a possibility that a filesystem
which has it has been unmounted. Besides, there is a possibility that the
module of the filesystem has been unloaded. In this case, something wrong
can happen.
(Example: While a filesystem is being unmounted, one of its resources can be
touched by using the ei->client_releasepage of the filesystem by
the side of calling blkdev_releasepage.)
------------------------------------------------------------------

Therefore some lock mechanisms are necessary to solve the races.

Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
  2008-12-12 17:52         ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
  2008-12-17 15:39         ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara
@ 2008-12-26  5:01         ` Al Viro
  2009-01-03 15:09           ` Theodore Ts'o
  2 siblings, 1 reply; 25+ messages in thread
From: Al Viro @ 2008-12-26  5:01 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, Toshiyuki Okajima, linux-fsdevel

On Fri, Dec 12, 2008 at 12:52:53PM -0500, Theodore Ts'o wrote:
> From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> 
> Implement blkdev_releasepage() to release the buffer_heads and page
> after we release private data which belongs to a client of the block
> device, such as a filesystem.
> 
> blkdev_releasepage() call the client's releasepage() which is
> registered by blkdev_register_client_releasepage() to release its
> private data.
> 
> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
Entire-pile-NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

a) Use of fs_type to hide a callback is wrong.  Pass it explicitly to
whatever helper you want, don't do that crapin get_sb_bdev()

b) the same goes from unregistration

c) you are unregistering your callback before doing generic_shutdown_super().
Odd, seeing that a _lot_ of fs operations can happen at that point.

d) we *already* have exclusion mechanism.  It's called open_bdev_exclusive()
and it is used by get_sb_bdev() and friends already.  Don't reinvent the
wheel, please.

e) what's going on with the locking there?  Comments about mount/umount races
are absolutely bogus - we *have* serialization for fs shutdown/startup.  And
if we hadn't, we would have far worse problems with races than that.  The
other kind of race is possible, but... this interface is asking for trouble.
It sounds like a way to attach some data structures of your own to page and
rely on that callback for freeing them.  But as soon as somebody tries that
we'll have a problem; page can outlive the unregistration of callback and
we'll get a leak (in the best case).  Sure, ext3 and ext4 won't step into
it (journal shutdown will deal with that), but it's a trap for unaware.
At the very least it needs to be commented.

Said that, I still don't like the use of rwlock here ;-/  If nothing else,
that calls for rcu - fs shutdown is extremely rare compared to releasepage...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems
  2008-12-26  5:01         ` Al Viro
@ 2009-01-03 15:09           ` Theodore Ts'o
  2009-01-03 15:09             ` [PATCH 1/3] add releasepage " Theodore Ts'o
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw)
  To: Al Viro; +Cc: Ext4 Developers List

Al,

Thanks for your comments; they were quite useful.  I've reworked
Okajima-san's patches to address your concerns.  What do you think?

	      	      	 	      		 - Ted



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems
  2009-01-03 15:09           ` Theodore Ts'o
@ 2009-01-03 15:09             ` Theodore Ts'o
  2009-01-03 15:09               ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
  2009-01-05  8:16               ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Ext4 Developers List, Theodore Ts'o, Toshiyuki Okajima,
	linux-fsdevel

Implement blkdev_releasepage() to release the buffer_heads and pages
after we release private data belonging to a mounted filesystem.

Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/block_dev.c     |   15 +++++++++++++++
 fs/super.c         |    2 ++
 include/linux/fs.h |    2 ++
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 99e0ae1..ef7d795 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1219,6 +1219,20 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+/*
+ * Try to release a page associated with block device when the system
+ * is under memory pressure.
+ */
+static int blkdev_releasepage(struct page *page, gfp_t wait)
+{
+	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
+
+	if (super && super->s_op->bdev_try_to_free_page)
+		return super->s_op->bdev_try_to_free_page(super, page, wait);
+
+	return try_to_free_buffers(page);
+}
+
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
@@ -1226,6 +1240,7 @@ static const struct address_space_operations def_blk_aops = {
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
 	.writepages	= generic_writepages,
+	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
 };
 
diff --git a/fs/super.c b/fs/super.c
index 400a760..d7e200d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -800,6 +800,7 @@ int get_sb_bdev(struct file_system_type *fs_type,
 		}
 
 		s->s_flags |= MS_ACTIVE;
+		bdev->bd_super = s;
 	}
 
 	return simple_set_mnt(mnt, s);
@@ -819,6 +820,7 @@ void kill_block_super(struct super_block *sb)
 	struct block_device *bdev = sb->s_bdev;
 	fmode_t mode = sb->s_mode;
 
+	bdev->bd_super = 0;
 	generic_shutdown_super(sb);
 	sync_blockdev(bdev);
 	close_bdev_exclusive(bdev, mode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..911f812 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -553,6 +553,7 @@ struct address_space {
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
+	struct super_block *	bd_super;
 	int			bd_openers;
 	struct mutex		bd_mutex;	/* open/close mutex */
 	struct semaphore	bd_mount_sem;
@@ -1375,6 +1376,7 @@ struct super_operations {
 	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
+	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
 };
 
 /*
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure
  2009-01-03 15:09             ` [PATCH 1/3] add releasepage " Theodore Ts'o
@ 2009-01-03 15:09               ` Theodore Ts'o
  2009-01-03 15:09                 ` [PATCH 3/3] ext4: " Theodore Ts'o
  2009-01-05  8:16               ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima
  1 sibling, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Ext4 Developers List, Toshiyuki Okajima, Theodore Ts'o,
	linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext3 data files are released via
the ext3_releasepage() function specified in the ext3 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a try_to_free_pages() function which calls
journal_try_to_free_buffers() function to free the metadata, and which
is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext3/super.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 541d5e4..93655a6 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -682,6 +682,25 @@ static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid *fid,
 				    ext3_nfs_get_inode);
 }
 
+/*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd layer's try_to_free_buffers() function to release them.
+ */
+static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
+				 gfp_t wait)
+{
+	journal_t *journal = EXT3_SB(sb)->s_journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	if (journal)
+		return journal_try_to_free_buffers(journal, page, wait);
+	return try_to_free_buffers(page);
+}
+
 #ifdef CONFIG_QUOTA
 #define QTYPE2NAME(t) ((t)==USRQUOTA?"user":"group")
 #define QTYPE2MOPT(on, t) ((t)==USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -746,6 +765,7 @@ static const struct super_operations ext3_sops = {
 	.quota_read	= ext3_quota_read,
 	.quota_write	= ext3_quota_write,
 #endif
+	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
 static const struct export_operations ext3_export_ops = {
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] ext4: provide function to release metadata pages under memory pressure
  2009-01-03 15:09               ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2009-01-03 15:09                 ` Theodore Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Ext4 Developers List, Toshiyuki Okajima, Theodore Ts'o,
	linux-fsdevel

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which calls
jbd2_journal_try_to_free_buffers() function to free the metadata, and
which is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/ext4/super.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0e0653d..980391d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -907,6 +907,24 @@ static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid,
 				    ext4_nfs_get_inode);
 }
 
+/*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd2 layer's try_to_free_buffers() function to release them.
+ */
+static int bdev_try_to_free_page(struct super_block *sb, struct page *page, gfp_t wait)
+{
+	journal_t *journal = EXT4_SB(sb)->s_journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	if (journal)
+		return jbd2_journal_try_to_free_buffers(journal, page, wait);
+	return try_to_free_buffers(page);
+}
+
 #ifdef CONFIG_QUOTA
 #define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
 #define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -971,6 +989,7 @@ static const struct super_operations ext4_sops = {
 	.quota_read	= ext4_quota_read,
 	.quota_write	= ext4_quota_write,
 #endif
+	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
 static const struct export_operations ext4_export_ops = {
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems
  2009-01-03 15:09             ` [PATCH 1/3] add releasepage " Theodore Ts'o
  2009-01-03 15:09               ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
@ 2009-01-05  8:16               ` Toshiyuki Okajima
  2009-01-05 16:05                 ` Theodore Tso
  1 sibling, 1 reply; 25+ messages in thread
From: Toshiyuki Okajima @ 2009-01-05  8:16 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel

Hi Ted-san,

Theodore Ts'o wrote:
> Implement blkdev_releasepage() to release the buffer_heads and pages
> after we release private data belonging to a mounted filesystem.
>
> Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/block_dev.c     |   15 +++++++++++++++
>  fs/super.c         |    2 ++
>  include/linux/fs.h |    2 ++
>  3 files changed, 19 insertions(+), 0 deletions(-)

I was confirming whether the kernel to which your new patch is applied can run
without trouble. But unfortunately, I got a hangup problem. Now I am
investigating the root cause.
After I investigated it for a little time, I think calling log_wait_commit() from
journal_try_to_free_buffers() can cause it.
I examine it a little more in detail.

Additional Info(Crash dump):
Backtrace:
crash> bt
PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
 #0 [f707dcbc] schedule at c06346a3
 #1 [f707dd34] log_wait_commit at f80904c1
 #2 [f707dd70] journal_try_to_free_buffers at f808c81f
 #3 [f707dd94] blkdev_releasepage at c04916cc
 #4 [f707dda4] try_to_release_page at c04526b1
 #5 [f707ddb0] shrink_page_list at c045b3d1
 #6 [f707de50] shrink_list at c045b72e
 #7 [f707def0] shrink_zone at c045bbc6
 #8 [f707df40] kswapd at c045c12c
 #9 [f707dfd8] kthread at c043612c
#10 [f707dfe4] kernel_thread_helper at c04045e1

Sleep time:
crash> ps -l | head -n 1
[5360808577593]  PID: 7995   TASK: c98b76d0  CPU: 1   COMMAND: "crtfile"
crash> ps -l 260
[3727586943566]  PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
crash> p (5360808577593 - 3727586943566)/1000000000
$4 = 1633
 ======> 1633 seconds

Best Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems
  2009-01-05  8:16               ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima
@ 2009-01-05 16:05                 ` Theodore Tso
  2009-01-06  4:07                   ` Toshiyuki Okajima
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Tso @ 2009-01-05 16:05 UTC (permalink / raw)
  To: Toshiyuki Okajima; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel

On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote:
> 
> I was confirming whether the kernel to which your new patch is
> applied can run without trouble. But unfortunately, I got a hangup
> problem. Now I am investigating the root cause.  After I
> investigated it for a little time, I think calling log_wait_commit()
> from journal_try_to_free_buffers() can cause it.  

Sounds like a deadlock caused by the fact that we're no longer masking
__GFP_WAIT, probably on journal->j_wait_done_commit.  Presumably the
system came under pressure during a commit operation, which makes
sense, and so we ended up with a deadlock between kjournald and
kswapd.  The fix is pretty simple; we just need to mask out the
__GFP_WAIT in the filesystem-specific callback, since this is a
restriction imposed by the filesystem's use of the jbd/jbd2 layer.

	    	       	   		       - Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems
  2009-01-05 16:05                 ` Theodore Tso
@ 2009-01-06  4:07                   ` Toshiyuki Okajima
  2009-01-06  4:29                     ` Theodore Tso
  0 siblings, 1 reply; 25+ messages in thread
From: Toshiyuki Okajima @ 2009-01-06  4:07 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel

Ted-san,

Theodore Tso wrote:
 > On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote:
 > > >
 > > > I was confirming whether the kernel to which your new patch is
 > > > applied can run without trouble. But unfortunately, I got a hangup
 > > > problem. Now I am investigating the root cause.  After I
 > > > investigated it for a little time, I think calling log_wait_commit()
 > > > from journal_try_to_free_buffers() can cause it.
 >
 > Sounds like a deadlock caused by the fact that we're no longer masking
 > __GFP_WAIT, probably on journal->j_wait_done_commit.  Presumably the
 > system came under pressure during a commit operation, which makes
 > sense, and so we ended up with a deadlock between kjournald and
 > kswapd.  The fix is pretty simple; we just need to mask out the
 > __GFP_WAIT in the filesystem-specific callback, since this is a
 > restriction imposed by the filesystem's use of the jbd/jbd2 layer.

Your opinion is correct.
A detailed investigation is done, and the root cause has been understood.

The deadlock was caused by the following two processes:

(1) A certain process
Memory collecting process which is started by a memory allocator calls
journal_try_to_free_buffers(). And then it calls log_wait_commit() to get more
memory and waits for the finish of one committing transaction.

(2) kjournald process
kjournald process starts by Process (1) calling log_wait_commit().
And then it calls journal_commit_transaction to write all data buffers
into the filesystem and write all metadata buffers into the journal storage.
Writing metadata buffer is journal_write_metadata_buffer(). This function also needs
new buffer_head (more memory) in order to copy a buffer_head.

Detailed Information:
Process (1):
crash> bt 260
PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
  #0 [f707dcbc] schedule at c06346a3
  #1 [f707dd34] log_wait_commit at f80904c1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -> It lets kjournald start
                                                and waits for the commit.
  #2 [f707dd70] journal_try_to_free_buffers at f808c81f
  #3 [f707dd94] blkdev_releasepage at c04916cc
  #4 [f707dda4] try_to_release_page at c04526b1
  #5 [f707ddb0] shrink_page_list at c045b3d1
  #6 [f707de50] shrink_list at c045b72e
  #7 [f707def0] shrink_zone at c045bbc6
  #8 [f707df40] kswapd at c045c12c
  #9 [f707dfd8] kthread at c043612c
#10 [f707dfe4] kernel_thread_helper at c04045e1

journal structure: 0xccab1e00

Process (2) [kjournald]:
PID: 3170   TASK: f717b240  CPU: 1   COMMAND: "kjournald"
  #0 [c42b4cf4] schedule at c06346a3
  #1 [c42b4d6c] schedule_timeout at c06349ef
  #2 [c42b4d90] io_schedule_timeout at c0633e0f
  #3 [c42b4da0] congestion_wait at c045d7ee
  #4 [c42b4dc8] try_to_free_pages at c045c82a
  #5 [c42b4e2c] __alloc_pages_internal at c04579fc
  #6 [c42b4e70] cache_alloc_refill at c0471235
  #7 [c42b4ec0] kmem_cache_alloc at c0470fa8
  #8 [c42b4ed4] alloc_buffer_head at c048c06b
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-> It tries to get a buffer but
                                               cannot get one. Because memory
                                               collectors (include: process (1))
                                               cannot go farther.
  #9 [c42b4edc] journal_write_metadata_buffer at f8090eb6
#10 [c42b4f10] journal_commit_transaction at f808df80
#11 [c42b4f98] kjournald at f809089d
#12 [c42b4fd8] kthread at c043612c
#13 [c42b4fe4] kernel_thread_helper at c04045e1

journal structure: 0xccab1e00

Additional Information:
The process by which the trigger of a deadlock is pulled is not only kswapd.
[1]
PID: 1800   TASK: f7379b60  CPU: 1   COMMAND: "rsyslogd"
  #0 [f61c3bfc] schedule at c06346a3
  #1 [f61c3c74] log_wait_commit at f80904c1
  #2 [f61c3cb0] journal_try_to_free_buffers at f808c81f
  #3 [f61c3cd4] blkdev_releasepage at c04916cc
  #4 [f61c3ce4] try_to_release_page at c04526b1
  #5 [f61c3cf0] shrink_page_list at c045b3d1
  #6 [f61c3d90] shrink_list at c045b72e
  #7 [f61c3e30] shrink_zone at c045bbc6
  #8 [f61c3e80] try_to_free_pages at c045c787
  #9 [f61c3ee4] __alloc_pages_internal at c04579fc
#10 [f61c3f28] __get_free_pages at c0457bac
#11 [f61c3f30] copy_process at c0425823
#12 [f61c3f68] do_fork at c042674b
#13 [f61c3fa4] sys_clone at c0402399
#14 [f61c3fb4] system_call at c0403893
     EAX: ffffffda  EBX: 003d0f00  ECX: b7fcd4b4  EDX: b7fcdbd8
     DS:  007b      ESI: b6fcb16c  ES:  007b      EDI: b7fcdbd8
     SS:  007b      ESP: b6fcb100  EBP: b6fcb198
     CS:  0073      EIP: 00d271f8  ERR: 00000078  EFLAGS: 00000296

[2]
PID: 1990   TASK: f70c6000  CPU: 0   COMMAND: "pcscd"
  #0 [f6078be0] schedule at c06346a3
  #1 [f6078c58] log_wait_commit at f80904c1
  #2 [f6078c94] journal_try_to_free_buffers at f808c81f
  #3 [f6078cb8] blkdev_releasepage at c04916cc
  #4 [f6078cc8] try_to_release_page at c04526b1
  #5 [f6078cd4] shrink_page_list at c045b3d1
  #6 [f6078d74] shrink_list at c045b72e
  #7 [f6078e14] shrink_zone at c045bbc6
  #8 [f6078e64] try_to_free_pages at c045c787
  #9 [f6078ec8] __alloc_pages_internal at c04579fc
#10 [f6078f0c] cache_alloc_refill at c0471235
#11 [f6078f5c] kmem_cache_alloc at c0470fa8
#12 [f6078f70] getname at c047b71c
#13 [f6078f88] do_sys_open at c04729d2
#14 [f6078fa0] sys_open at c0472ab6
#15 [f6078fb4] ia32_sysenter_target at c04037da
     EAX: 00000005  EBX: 006a2700  ECX: 00098800  EDX: 00000000
     DS:  007b      ESI: 006a2700  ES:  007b      EDI: 00000000
     SS:  007b      ESP: b801d0f8  EBP: b801d188
     CS:  0073      EIP: b803f424  ERR: 00000005  EFLAGS: 00000202
...

Regards,
Toshiyuki Okajima


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems
  2009-01-06  4:07                   ` Toshiyuki Okajima
@ 2009-01-06  4:29                     ` Theodore Tso
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Tso @ 2009-01-06  4:29 UTC (permalink / raw)
  To: Toshiyuki Okajima; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel

On Tue, Jan 06, 2009 at 01:07:27PM +0900, Toshiyuki Okajima wrote:
> > Sounds like a deadlock caused by the fact that we're no longer masking
> > __GFP_WAIT, probably on journal->j_wait_done_commit.  Presumably the
> > system came under pressure during a commit operation, which makes
> > sense, and so we ended up with a deadlock between kjournald and
> > kswapd.  The fix is pretty simple; we just need to mask out the
> > __GFP_WAIT in the filesystem-specific callback, since this is a
> > restriction imposed by the filesystem's use of the jbd/jbd2 layer.
>
> Your opinion is correct.
> A detailed investigation is done, and the root cause has been understood.

Thanks for confirming my supposition.  I've checked the following
modified patches into the ext4 patch queue, which will be sent to
Linus shortly.

						- Ted

ext3: provide function to release metadata pages under memory pressure

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext3 data files are released via
the ext3_releasepage() function specified in the ext3 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a try_to_free_pages() function which calls
journal_try_to_free_buffers() function to free the metadata, and which
is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 541d5e4..6900ff0 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -682,6 +682,26 @@ static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid *fid,
 				    ext3_nfs_get_inode);
 }
 
+/*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd layer's try_to_free_buffers() function to release them.
+ */
+static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
+				 gfp_t wait)
+{
+	journal_t *journal = EXT3_SB(sb)->s_journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	if (journal)
+		return journal_try_to_free_buffers(journal, page, 
+						   wait & ~__GFP_WAIT);
+	return try_to_free_buffers(page);
+}
+
 #ifdef CONFIG_QUOTA
 #define QTYPE2NAME(t) ((t)==USRQUOTA?"user":"group")
 #define QTYPE2MOPT(on, t) ((t)==USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -746,6 +766,7 @@ static const struct super_operations ext3_sops = {
 	.quota_read	= ext3_quota_read,
 	.quota_write	= ext3_quota_write,
 #endif
+	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
 static const struct export_operations ext3_export_ops = {

----------------------

ext4: provide function to release metadata pages under memory pressure

From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>

Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which calls
jbd2_journal_try_to_free_buffers() function to free the metadata, and
which is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0023ca9..5cfac23 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -908,6 +908,25 @@ static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid,
 				    ext4_nfs_get_inode);
 }
 
+/*
+ * Try to release metadata pages (indirect blocks, directories) which are
+ * mapped via the block device.  Since these pages could have journal heads
+ * which would prevent try_to_free_buffers() from freeing them, we must use
+ * jbd2 layer's try_to_free_buffers() function to release them.
+ */
+static int bdev_try_to_free_page(struct super_block *sb, struct page *page, gfp_t wait)
+{
+	journal_t *journal = EXT4_SB(sb)->s_journal;
+
+	WARN_ON(PageChecked(page));
+	if (!page_has_buffers(page))
+		return 0;
+	if (journal)
+		return jbd2_journal_try_to_free_buffers(journal, page,
+							wait & ~__GFP_WAIT);
+	return try_to_free_buffers(page);
+}
+
 #ifdef CONFIG_QUOTA
 #define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
 #define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -972,6 +991,7 @@ static const struct super_operations ext4_sops = {
 	.quota_read	= ext4_quota_read,
 	.quota_write	= ext4_quota_write,
 #endif
+	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
 static const struct export_operations ext4_export_ops = {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-01-06  4:29 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-02 11:06 [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
2008-12-08 14:01 ` Theodore Tso
2008-12-08 14:06   ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
2008-12-08 14:06     ` [PATCH -V2] ext4: " Theodore Ts'o
2008-12-12  0:54   ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
2008-12-12  6:21     ` Theodore Tso
2008-12-12 17:52       ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o
2008-12-12 17:52         ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
2008-12-12 17:52           ` [PATCH -v3] ext4: " Theodore Ts'o
2008-12-17 15:39         ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara
2008-12-18  5:15           ` Toshiyuki Okajima
2008-12-18 13:12             ` Jan Kara
2008-12-18 14:54               ` Theodore Tso
2008-12-18 16:38                 ` Jan Kara
2008-12-19  5:15               ` Toshiyuki Okajima
2008-12-26  5:01         ` Al Viro
2009-01-03 15:09           ` Theodore Ts'o
2009-01-03 15:09             ` [PATCH 1/3] add releasepage " Theodore Ts'o
2009-01-03 15:09               ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o
2009-01-03 15:09                 ` [PATCH 3/3] ext4: " Theodore Ts'o
2009-01-05  8:16               ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima
2009-01-05 16:05                 ` Theodore Tso
2009-01-06  4:07                   ` Toshiyuki Okajima
2009-01-06  4:29                     ` Theodore Tso
2008-12-15  2:21       ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.