* [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage @ 2008-12-02 11:06 Toshiyuki Okajima 2008-12-08 14:01 ` Theodore Tso 0 siblings, 1 reply; 25+ messages in thread From: Toshiyuki Okajima @ 2008-12-02 11:06 UTC (permalink / raw) To: aneesh.kumar, balbir, tytso; +Cc: linux-ext4 ext3: fix a cause of __schedule_bug via blkdev_releasepage From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> A cause of this problem is calling log_wait_commit() on journal_try_to_free_buffers() with a read-lock via blkdev_releasepage(). This logic is for uncommitted data buffers. And a read/write-lock is required for a client usage of blkdev_releasepage. By the way, we want to release only metadata buffers on ext3_release_metadata(). Because a page which binds to blkdev is used as metadata for ext3. Therefore we don't need to wait for a commit on journal_try_to_free_buffers() via ext3_release_matadata(). As a result, we add a journal_try_to_free_metadata_buffers() almost same as journal_try_to_free_buffers() except not calling log_wait_commit. So, we change journal_try_to_free_buffers() with journal_try_to_free_metadata_buffers(). This issue was reported by Aneesh Kumar K.V. http://marc.info/?l=linux-ext4&m=122814568309893&w=2 Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> -- fs/ext3/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -Nurp linux-2.6.28-rc6/fs/ext3/inode.c linux-2.6.28-rc6.2/fs/ext3/inode.c --- linux-2.6.28-rc6/fs/ext3/inode.c 2008-12-02 09:32:27.000000000 +0900 +++ linux-2.6.28-rc6.2/fs/ext3/inode.c 2008-12-02 10:19:53.000000000 +0900 @@ -1696,7 +1696,7 @@ int ext3_release_metadata(void *client, BUG_ON(EXT3_SB(sb) == NULL); journal = EXT3_SB(sb)->s_journal; if (journal != NULL) - return journal_try_to_free_buffers(journal, page, wait); + return journal_try_to_free_metadata_buffers(journal, page, wait); else return try_to_free_buffers(page); } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage 2008-12-02 11:06 [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima @ 2008-12-08 14:01 ` Theodore Tso 2008-12-08 14:06 ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2008-12-12 0:54 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 0 siblings, 2 replies; 25+ messages in thread From: Theodore Tso @ 2008-12-08 14:01 UTC (permalink / raw) To: Toshiyuki Okajima; +Cc: aneesh.kumar, balbir, linux-ext4 Toshiyuki-san, My apologies for not having a chance to review your patches; things have been rather busy for me. There were a couple of shortcomings in your patch series, and I think there is a better way of solving the issue at hand. First of all, patches #1 and #2 use a new function which is not actually defined until patches #3 and #4, respectively. This can make it difficult for people who are trying to use "git bisect" to try to localize the root cause of a problem. Secondly, the introduction of a large number of wrapper functions increases stack utilization, and makes the call depth deeper (and in the long run, each increase in call depth makes the code that much harder to trace through and understand), and so it should be done only as last resort. Fortunately, there is a simpler way of fixing this problem. I include the inter-diff below, but I will fold this into the current blkdev_releasepage() patches that are in the ext4 patch queue. Best regards, - Ted P.S. Note that this patch is functionally identical to what you proposed in your patch series, but since the gfp_wait mask already controlls whether or not log_wait_commit() is called, instead of introducing a new functional parameter, we just mask off __GFP_WAIT before calling jbd2_journal_try_to_free_buffers. I'll make a similar change to the ext3 patch, and attach the two revised patches to this mail thread. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e77a059..543f3d0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3049,6 +3049,12 @@ static int ext4_releasepage(struct page *page, gfp_t wait) * mapped via the block device. Since these pages could have journal heads * which would prevent try_to_free_buffers() from freeing them, we must use * jbd2 layer's try_to_free_buffers() function to release them. + * + * Note: we have to strip the __GFP_WAIT flag before calling + * jbd2_journal_try_to_free_buffers because blkdev_releasepage is + * called while holding a spinlock (bdev_inode.client_lock). + * Fortunately the metadata buffers we are interested are freed right + * away and do not require calling journal_wait_for_transaction_sync_data(). */ int ext4_release_metadata(void *client, struct page *page, gfp_t wait) { @@ -3061,7 +3067,8 @@ int ext4_release_metadata(void *client, struct page *page, gfp_t wait) BUG_ON(EXT4_SB(sb) == NULL); journal = EXT4_SB(sb)->s_journal; if (journal != NULL) - return jbd2_journal_try_to_free_buffers(journal, page, wait); + return jbd2_journal_try_to_free_buffers(journal, page, + wait & ~__GFP_WAIT); else return try_to_free_buffers(page); } ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH -V2] ext3: provide function to release metadata pages under memory pressure 2008-12-08 14:01 ` Theodore Tso @ 2008-12-08 14:06 ` Theodore Ts'o 2008-12-08 14:06 ` [PATCH -V2] ext4: " Theodore Ts'o 2008-12-12 0:54 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 1 sibling, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2008-12-08 14:06 UTC (permalink / raw) To: Ext4 Developers List Cc: aneesh.kumar, balbir, Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext3 data files are released via the ext3_releasepage() function specified in the ext3 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which is called by the block device's blkdev_releasepage() function, which calls journal_try_to_free_buffers() function to free the metadata. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext3/inode.c | 29 +++++++++++++++++++++++++++++ fs/ext3/super.c | 2 ++ include/linux/ext3_fs.h | 2 ++ 3 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index f8424ad..89c065a 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1680,6 +1680,35 @@ static int ext3_releasepage(struct page *page, gfp_t wait) } /* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd layer's try_to_free_buffers() function to release them. + * + * Note: we have to strip the __GFP_WAIT flag before calling + * journal_try_to_free_buffers because blkdev_releasepage is + * called while holding a spinlock (bdev_inode.client_lock). + * Fortunately the metadata buffers we are interested are freed right + * away and do not require calling journal_wait_for_transaction_sync_data(). + */ +int ext3_release_metadata(void *client, struct page *page, gfp_t wait) +{ + struct super_block *sb = (struct super_block*)client; + journal_t *journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + BUG_ON(EXT3_SB(sb) == NULL); + journal = EXT3_SB(sb)->s_journal; + if (journal != NULL) + return journal_try_to_free_buffers(journal, page, + wait & ~__GFP_WAIT); + else + return try_to_free_buffers(page); +} + +/* * If the O_DIRECT write will extend the file then add this inode to the * orphan list. So recovery will truncate it back to the original size * if the machine crashes during the write. diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 541d5e4..4428149 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2981,6 +2981,8 @@ static struct file_system_type ext3_fs_type = { .name = "ext3", .get_sb = ext3_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext3_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 9004794..3e3052f 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -867,6 +867,8 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len); +extern int ext3_release_metadata(void *client, struct page *page, + gfp_t wait); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH -V2] ext4: provide function to release metadata pages under memory pressure 2008-12-08 14:06 ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2008-12-08 14:06 ` Theodore Ts'o 0 siblings, 0 replies; 25+ messages in thread From: Theodore Ts'o @ 2008-12-08 14:06 UTC (permalink / raw) To: Ext4 Developers List Cc: aneesh.kumar, balbir, Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which is called by the block device's blkdev_releasepage() function, which calls journal_try_to_free_buffers() function to free the metadata. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/ext4.h | 2 ++ fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++ fs/ext4/super.c | 4 ++++ 3 files changed, 35 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 695b45c..91e06e4 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1099,6 +1099,8 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); +extern int ext4_release_metadata(void *client, struct page *page, + gfp_t wait); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f6d9447..1647903 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3045,6 +3045,35 @@ static int ext4_releasepage(struct page *page, gfp_t wait) } /* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd2 layer's try_to_free_buffers() function to release them. + * + * Note: we have to strip the __GFP_WAIT flag before calling + * jbd2_journal_try_to_free_buffers because blkdev_releasepage is + * called while holding a spinlock (bdev_inode.client_lock). + * Fortunately the metadata buffers we are interested are freed right + * away and do not require calling journal_wait_for_transaction_sync_data(). + */ +int ext4_release_metadata(void *client, struct page *page, gfp_t wait) +{ + struct super_block *sb = (struct super_block*)client; + journal_t *journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + BUG_ON(EXT4_SB(sb) == NULL); + journal = EXT4_SB(sb)->s_journal; + if (journal != NULL) + return jbd2_journal_try_to_free_buffers(journal, page, + wait & ~__GFP_WAIT); + else + return try_to_free_buffers(page); +} + +/* * If the O_DIRECT write will extend the file then add this inode to the * orphan list. So recovery will truncate it back to the original size * if the machine crashes during the write. diff --git a/fs/ext4/super.c b/fs/ext4/super.c index bd41fad..f447c46 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3689,6 +3689,8 @@ static struct file_system_type ext4_fs_type = { .name = "ext4", .get_sb = ext4_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext4_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; @@ -3708,6 +3710,8 @@ static struct file_system_type ext4dev_fs_type = { .name = "ext4dev", .get_sb = ext4dev_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext4_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; MODULE_ALIAS("ext4dev"); -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage 2008-12-08 14:01 ` Theodore Tso 2008-12-08 14:06 ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2008-12-12 0:54 ` Toshiyuki Okajima 2008-12-12 6:21 ` Theodore Tso 1 sibling, 1 reply; 25+ messages in thread From: Toshiyuki Okajima @ 2008-12-12 0:54 UTC (permalink / raw) To: Theodore Tso; +Cc: aneesh.kumar, balbir, linux-ext4 Ted-san, Thank you for your reviewing. > Toshiyuki-san, > > My apologies for not having a chance to review your patches; things > have been rather busy for me. There were a couple of shortcomings in > your patch series, and I think there is a better way of solving the > issue at hand. First of all, patches #1 and #2 use a new function > which is not actually defined until patches #3 and #4, respectively. > This can make it difficult for people who are trying to use "git > bisect" to try to localize the root cause of a problem. > > Secondly, the introduction of a large number of wrapper functions > increases stack utilization, and makes the call depth deeper (and in > the long run, each increase in call depth makes the code that much > harder to trace through and understand), and so it should be done only > as last resort. Fortunately, there is a simpler way of fixing this > problem. I include the inter-diff below, but I will fold this into > the current blkdev_releasepage() patches that are in the ext4 patch > queue. > > Best regards, > > - Ted > > P.S. Note that this patch is functionally identical to what you > proposed in your patch series, but since the gfp_wait mask already > controlls whether or not log_wait_commit() is called, instead of > introducing a new functional parameter, we just mask off __GFP_WAIT > before calling jbd2_journal_try_to_free_buffers. I'll make a similar > change to the ext3 patch, and attach the two revised patches to this > mail thread. Through the idea as follows, I agree to your change. ----------------------------------------------------------------------------- To tell the truth, at first, I imagined the same patch as yours to fix this problem. But I have made another patch because I thought that ext3(or ext4) should not know the contents of the processing of journal_try_to_free_buffers in detail. (ext3 should not know there is a possibility to call journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.) So, I have made a new function, journal_try_to_free_metadata_buffers to release only metadata buffer_heads. (I wanted a function which is the almost same as journal_try_to_free_buffers except calling journal_wait_for_transaction_sync_data from it.) However, this new function needed big changes, you know. I reconsidered what was the most suitable patch to fix this problem after I read your mail (patch). And then, I thought that it was important to make the most concise patch to fix only a root cause. Big patch is not easy to understand even if it is more logical one. Therefore, there is the fact that ext3_release_metadata must not sleep because it can get a spinlock, and then, only changing ext3_release_metadata to the logic to make it not sleep is the simplest fix for this problem. ----------------------------------------------------------------------------- Best Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage 2008-12-12 0:54 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima @ 2008-12-12 6:21 ` Theodore Tso 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o 2008-12-15 2:21 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 0 siblings, 2 replies; 25+ messages in thread From: Theodore Tso @ 2008-12-12 6:21 UTC (permalink / raw) To: Toshiyuki Okajima; +Cc: aneesh.kumar, balbir, linux-ext4 On Fri, Dec 12, 2008 at 09:54:18AM +0900, Toshiyuki Okajima wrote: > To tell the truth, at first, I imagined the same patch as yours to fix this > problem. But I have made another patch because I thought that ext3(or ext4) > should not know the contents of the processing of journal_try_to_free_buffers > in detail. (ext3 should not know there is a possibility to call > journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.) I agree, but ext3 doesn't need to know that. What my change did was to mask off the _GFP_WAIT flag, which prohibits the function it calls from blocking, because it knows that its caller is holding a spinlock. And actually, come to think of it. We can do even better; the right fix is to have blkdev_releasepage() mask off the _GFP_WAIT flag; this is the function which is taking spinlock, and by masking off the __GFP_WAIT flag, this is simply requesting all of the downstream functions not to block, but to do the best job they can do without blocking. It doesn't need to know whether it's going to call log_wait_commit(), or anything else; all it needs to do is request "please don't block". That means we only make the request once, in the function which is taking spinlock, so all of the per-filesystem implementations of release_metadata() don't need to know that its caller is holding a spinlock. - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-12 6:21 ` Theodore Tso @ 2008-12-12 17:52 ` Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o ` (2 more replies) 2008-12-15 2:21 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 1 sibling, 3 replies; 25+ messages in thread From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw) To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Implement blkdev_releasepage() to release the buffer_heads and page after we release private data which belongs to a client of the block device, such as a filesystem. blkdev_releasepage() call the client's releasepage() which is registered by blkdev_register_client_releasepage() to release its private data. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/block_dev.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/super.c | 22 ++++++++++++++++ include/linux/fs.h | 9 +++++++ 3 files changed, 99 insertions(+), 0 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index db831ef..bac0a38 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -29,6 +29,9 @@ struct bdev_inode { struct block_device bdev; + void *client; + int (*client_releasepage)(void*, struct page*, gfp_t); + rwlock_t client_lock; struct inode vfs_inode; }; @@ -260,6 +263,9 @@ static struct inode *bdev_alloc_inode(struct super_block *sb) struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); if (!ei) return NULL; + ei->client = NULL; + ei->client_releasepage = NULL; + rwlock_init(&ei->client_lock); return &ei->vfs_inode; } @@ -1208,6 +1214,67 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) return blkdev_ioctl(bdev, mode, cmd, arg); } +/* + * blkdev_releasepage: execute ei->client_releasepage() if it exists. + * Otherwise, execute try_to_free_buffers(). + * ei->client_releasepage() releases private client's page if possible. + * Because a buffer_head's using counter is bigger than 0 if a client has + * a page for private usage. If so, try_to_free_buffers() cannot release it. + * Therefore a client must try to release a page itself. + */ +static int blkdev_releasepage(struct page *page, gfp_t wait) +{ + struct bdev_inode *ei = BDEV_I(page->mapping->host); + int ret; + + read_lock(&ei->client_lock); + if (ei->client_releasepage != NULL) + /* + * Since we are holding a spinlock (ei->client_lock), + * make sure the client_releasepage function + * understands that it must not block. + */ + ret = (*ei->client_releasepage)(ei->client, page, + wait & ~__GFP_WAIT); + else + ret = try_to_free_buffers(page); + read_unlock(&ei->client_lock); + return ret; +} + +/* + * blkdev_register_client_releasepage: register client_releasepage. + */ +int blkdev_register_client_releasepage(struct block_device *bdev, + void *client, int (*releasepage)(void*, struct page*, gfp_t)) +{ + struct bdev_inode *ei = BDEV_I(bdev->bd_inode); + int ret = 1; + + write_lock(&ei->client_lock); + if (ei->client == NULL && ei->client_releasepage == NULL) { + ei->client = client; + ei->client_releasepage = releasepage; + } else if (ei->client != client + || ei->client_releasepage != releasepage) + ret = 0; + write_unlock(&ei->client_lock); + return ret; +} + +/* + * blkdev_unregister_client_releasepage: unregister client_releasepage. + */ +void blkdev_unregister_client_releasepage(struct block_device *bdev) +{ + struct bdev_inode *ei = BDEV_I(bdev->bd_inode); + + write_lock(&ei->client_lock); + ei->client = NULL; + ei->client_releasepage = NULL; + write_unlock(&ei->client_lock); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, @@ -1215,6 +1282,7 @@ static const struct address_space_operations def_blk_aops = { .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, .writepages = generic_writepages, + .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, }; diff --git a/fs/super.c b/fs/super.c index 400a760..fd254eb 100644 --- a/fs/super.c +++ b/fs/super.c @@ -801,6 +801,18 @@ int get_sb_bdev(struct file_system_type *fs_type, s->s_flags |= MS_ACTIVE; } + /* + * register a client function which releases a page whose mapping is + * block device + */ + if (fs_type->release_metadata != NULL + && !blkdev_register_client_releasepage(bdev, s, + fs_type->release_metadata)) { + up_write(&s->s_umount); + deactivate_super(s); + error = -EBUSY; + goto error_bdev; + } return simple_set_mnt(mnt, s); @@ -819,6 +831,16 @@ void kill_block_super(struct super_block *sb) struct block_device *bdev = sb->s_bdev; fmode_t mode = sb->s_mode; + /* + * unregister a client function which releases a page whose mapping is + * block device + * + * This is sure to be unmounting here, and it releases all own data + * itself. Therefore the filesystem's function which is owned by the + * block device, which releases its data is not needed any more. + */ + if (sb->s_type->release_metadata != NULL) + blkdev_unregister_client_releasepage(bdev); generic_shutdown_super(sb); sync_blockdev(bdev); close_bdev_exclusive(bdev, mode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 0dcdd94..398c8ed 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1538,6 +1538,7 @@ struct file_system_type { int (*get_sb) (struct file_system_type *, int, const char *, void *, struct vfsmount *); void (*kill_sb) (struct super_block *); + int (*release_metadata)(void*, struct page*, gfp_t); struct module *owner; struct file_system_type * next; struct list_head fs_supers; @@ -1699,8 +1700,16 @@ extern void bd_set_size(struct block_device *, loff_t size); extern void bd_forget(struct inode *inode); extern void bdput(struct block_device *); extern struct block_device *open_by_devnum(dev_t, fmode_t); +extern int blkdev_register_client_releasepage(struct block_device *, + void *, int (*releasepage)(void *, struct page*, gfp_t)); +extern void blkdev_unregister_client_releasepage(struct block_device *); #else static inline void bd_forget(struct inode *inode) {} +static inline int blkdev_register_client_releasepage(struct block_device *, + void *, int (*releasepage)(void *, struct page*, gfp_t)) +{ return 1; } +static inline void blkdev_unregister_client_releasepage(struct block_device *) +{} #endif extern const struct file_operations def_blk_fops; extern const struct file_operations def_chr_fops; -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH -v3] ext3: provide function to release metadata pages under memory pressure 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o @ 2008-12-12 17:52 ` Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext4: " Theodore Ts'o 2008-12-17 15:39 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara 2008-12-26 5:01 ` Al Viro 2 siblings, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw) To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext3 data files are released via the ext3_releasepage() function specified in the ext3 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which is called by the block device's blkdev_releasepage() function, which calls journal_try_to_free_buffers() function to free the metadata. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext3/inode.c | 22 ++++++++++++++++++++++ fs/ext3/super.c | 2 ++ include/linux/ext3_fs.h | 2 ++ 3 files changed, 26 insertions(+), 0 deletions(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index f8424ad..b1fecda 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1680,6 +1680,28 @@ static int ext3_releasepage(struct page *page, gfp_t wait) } /* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd layer's try_to_free_buffers() function to release them. + */ +int ext3_release_metadata(void *client, struct page *page, gfp_t wait) +{ + struct super_block *sb = (struct super_block*)client; + journal_t *journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + BUG_ON(EXT3_SB(sb) == NULL); + journal = EXT3_SB(sb)->s_journal; + if (journal != NULL) + return journal_try_to_free_buffers(journal, page, wait); + else + return try_to_free_buffers(page); +} + +/* * If the O_DIRECT write will extend the file then add this inode to the * orphan list. So recovery will truncate it back to the original size * if the machine crashes during the write. diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 541d5e4..4428149 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2981,6 +2981,8 @@ static struct file_system_type ext3_fs_type = { .name = "ext3", .get_sb = ext3_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext3_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 9004794..3e3052f 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -867,6 +867,8 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len); +extern int ext3_release_metadata(void *client, struct page *page, + gfp_t wait); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH -v3] ext4: provide function to release metadata pages under memory pressure 2008-12-12 17:52 ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2008-12-12 17:52 ` Theodore Ts'o 0 siblings, 0 replies; 25+ messages in thread From: Theodore Ts'o @ 2008-12-12 17:52 UTC (permalink / raw) To: Ext4 Developers List; +Cc: Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which is called by the block device's blkdev_releasepage() function, which calls journal_try_to_free_buffers() function to free the metadata. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/ext4.h | 2 ++ fs/ext4/inode.c | 22 ++++++++++++++++++++++ fs/ext4/super.c | 4 ++++ 3 files changed, 28 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 695b45c..91e06e4 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1099,6 +1099,8 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); +extern int ext4_release_metadata(void *client, struct page *page, + gfp_t wait); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f6d9447..e77a059 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3045,6 +3045,28 @@ static int ext4_releasepage(struct page *page, gfp_t wait) } /* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd2 layer's try_to_free_buffers() function to release them. + */ +int ext4_release_metadata(void *client, struct page *page, gfp_t wait) +{ + struct super_block *sb = (struct super_block*)client; + journal_t *journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + BUG_ON(EXT4_SB(sb) == NULL); + journal = EXT4_SB(sb)->s_journal; + if (journal != NULL) + return jbd2_journal_try_to_free_buffers(journal, page, wait); + else + return try_to_free_buffers(page); +} + +/* * If the O_DIRECT write will extend the file then add this inode to the * orphan list. So recovery will truncate it back to the original size * if the machine crashes during the write. diff --git a/fs/ext4/super.c b/fs/ext4/super.c index bd41fad..f447c46 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3689,6 +3689,8 @@ static struct file_system_type ext4_fs_type = { .name = "ext4", .get_sb = ext4_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext4_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; @@ -3708,6 +3710,8 @@ static struct file_system_type ext4dev_fs_type = { .name = "ext4dev", .get_sb = ext4dev_get_sb, .kill_sb = kill_block_super, + .release_metadata + = ext4_release_metadata, .fs_flags = FS_REQUIRES_DEV, }; MODULE_ALIAS("ext4dev"); -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2008-12-17 15:39 ` Jan Kara 2008-12-18 5:15 ` Toshiyuki Okajima 2008-12-26 5:01 ` Al Viro 2 siblings, 1 reply; 25+ messages in thread From: Jan Kara @ 2008-12-17 15:39 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List, Toshiyuki Okajima, linux-fsdevel Hello, > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > > Implement blkdev_releasepage() to release the buffer_heads and page > after we release private data which belongs to a client of the block > device, such as a filesystem. > > blkdev_releasepage() call the client's releasepage() which is > registered by blkdev_register_client_releasepage() to release its > private data. Yes, this is IMO the right fix. I'm just wondering about the fact that we can't block in the client_releasepage(). That seems to be caused by the fact that we need to be protected against client_releasepage() callback changes which essentially means umount, right? I'm not saying I have a better solution but introducing such limitation seems stupid just because of umount... Honza > Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: linux-fsdevel@vger.kernel.org > --- > fs/block_dev.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/super.c | 22 ++++++++++++++++ > include/linux/fs.h | 9 +++++++ > 3 files changed, 99 insertions(+), 0 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index db831ef..bac0a38 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -29,6 +29,9 @@ > > struct bdev_inode { > struct block_device bdev; > + void *client; > + int (*client_releasepage)(void*, struct page*, gfp_t); > + rwlock_t client_lock; > struct inode vfs_inode; > }; > > @@ -260,6 +263,9 @@ static struct inode *bdev_alloc_inode(struct super_block *sb) > struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); > if (!ei) > return NULL; > + ei->client = NULL; > + ei->client_releasepage = NULL; > + rwlock_init(&ei->client_lock); > return &ei->vfs_inode; > } > > @@ -1208,6 +1214,67 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) > return blkdev_ioctl(bdev, mode, cmd, arg); > } > > +/* > + * blkdev_releasepage: execute ei->client_releasepage() if it exists. > + * Otherwise, execute try_to_free_buffers(). > + * ei->client_releasepage() releases private client's page if possible. > + * Because a buffer_head's using counter is bigger than 0 if a client has > + * a page for private usage. If so, try_to_free_buffers() cannot release it. > + * Therefore a client must try to release a page itself. > + */ > +static int blkdev_releasepage(struct page *page, gfp_t wait) > +{ > + struct bdev_inode *ei = BDEV_I(page->mapping->host); > + int ret; > + > + read_lock(&ei->client_lock); > + if (ei->client_releasepage != NULL) > + /* > + * Since we are holding a spinlock (ei->client_lock), > + * make sure the client_releasepage function > + * understands that it must not block. > + */ > + ret = (*ei->client_releasepage)(ei->client, page, > + wait & ~__GFP_WAIT); > + else > + ret = try_to_free_buffers(page); > + read_unlock(&ei->client_lock); > + return ret; > +} > + > +/* > + * blkdev_register_client_releasepage: register client_releasepage. > + */ > +int blkdev_register_client_releasepage(struct block_device *bdev, > + void *client, int (*releasepage)(void*, struct page*, gfp_t)) > +{ > + struct bdev_inode *ei = BDEV_I(bdev->bd_inode); > + int ret = 1; > + > + write_lock(&ei->client_lock); > + if (ei->client == NULL && ei->client_releasepage == NULL) { > + ei->client = client; > + ei->client_releasepage = releasepage; > + } else if (ei->client != client > + || ei->client_releasepage != releasepage) > + ret = 0; > + write_unlock(&ei->client_lock); > + return ret; > +} > + > +/* > + * blkdev_unregister_client_releasepage: unregister client_releasepage. > + */ > +void blkdev_unregister_client_releasepage(struct block_device *bdev) > +{ > + struct bdev_inode *ei = BDEV_I(bdev->bd_inode); > + > + write_lock(&ei->client_lock); > + ei->client = NULL; > + ei->client_releasepage = NULL; > + write_unlock(&ei->client_lock); > +} > + > static const struct address_space_operations def_blk_aops = { > .readpage = blkdev_readpage, > .writepage = blkdev_writepage, > @@ -1215,6 +1282,7 @@ static const struct address_space_operations def_blk_aops = { > .write_begin = blkdev_write_begin, > .write_end = blkdev_write_end, > .writepages = generic_writepages, > + .releasepage = blkdev_releasepage, > .direct_IO = blkdev_direct_IO, > }; > > diff --git a/fs/super.c b/fs/super.c > index 400a760..fd254eb 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -801,6 +801,18 @@ int get_sb_bdev(struct file_system_type *fs_type, > > s->s_flags |= MS_ACTIVE; > } > + /* > + * register a client function which releases a page whose mapping is > + * block device > + */ > + if (fs_type->release_metadata != NULL > + && !blkdev_register_client_releasepage(bdev, s, > + fs_type->release_metadata)) { > + up_write(&s->s_umount); > + deactivate_super(s); > + error = -EBUSY; > + goto error_bdev; > + } > > return simple_set_mnt(mnt, s); > > @@ -819,6 +831,16 @@ void kill_block_super(struct super_block *sb) > struct block_device *bdev = sb->s_bdev; > fmode_t mode = sb->s_mode; > > + /* > + * unregister a client function which releases a page whose mapping is > + * block device > + * > + * This is sure to be unmounting here, and it releases all own data > + * itself. Therefore the filesystem's function which is owned by the > + * block device, which releases its data is not needed any more. > + */ > + if (sb->s_type->release_metadata != NULL) > + blkdev_unregister_client_releasepage(bdev); > generic_shutdown_super(sb); > sync_blockdev(bdev); > close_bdev_exclusive(bdev, mode); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 0dcdd94..398c8ed 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1538,6 +1538,7 @@ struct file_system_type { > int (*get_sb) (struct file_system_type *, int, > const char *, void *, struct vfsmount *); > void (*kill_sb) (struct super_block *); > + int (*release_metadata)(void*, struct page*, gfp_t); > struct module *owner; > struct file_system_type * next; > struct list_head fs_supers; > @@ -1699,8 +1700,16 @@ extern void bd_set_size(struct block_device *, loff_t size); > extern void bd_forget(struct inode *inode); > extern void bdput(struct block_device *); > extern struct block_device *open_by_devnum(dev_t, fmode_t); > +extern int blkdev_register_client_releasepage(struct block_device *, > + void *, int (*releasepage)(void *, struct page*, gfp_t)); > +extern void blkdev_unregister_client_releasepage(struct block_device *); > #else > static inline void bd_forget(struct inode *inode) {} > +static inline int blkdev_register_client_releasepage(struct block_device *, > + void *, int (*releasepage)(void *, struct page*, gfp_t)) > +{ return 1; } > +static inline void blkdev_unregister_client_releasepage(struct block_device *) > +{} > #endif > extern const struct file_operations def_blk_fops; > extern const struct file_operations def_chr_fops; > -- > 1.6.0.4.8.g36f27.dirty > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-17 15:39 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara @ 2008-12-18 5:15 ` Toshiyuki Okajima 2008-12-18 13:12 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Toshiyuki Okajima @ 2008-12-18 5:15 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel Hi, > Hello, > > > > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > > > > > > Implement blkdev_releasepage() to release the buffer_heads and page > > > after we release private data which belongs to a client of the block > > > device, such as a filesystem. > > > > > > blkdev_releasepage() call the client's releasepage() which is > > > registered by blkdev_register_client_releasepage() to release its > > > private data. > Yes, this is IMO the right fix. I'm just wondering about the fact that we > can't block in the client_releasepage(). That seems to be caused by the fact > that we need to be protected against client_releasepage() callback changes > which essentially means umount, right? I'm not saying I have a better solution > but introducing such limitation seems stupid just because of umount... > > Honza > Difference between v2 and v3 in blkdev_releasepage: < ret = (*ei->client_releasepage)(ei->client, page, wait); < else -- > /* > * Since we are holding a spinlock (ei->client_lock), > * make sure the client_releasepage function > * understands that it must not block. > */ > ret = (*ei->client_releasepage)(ei->client, page, > wait & ~__GFP_WAIT); > else Ask for clarification. Which of the following do you mean: 1) If using a spinlock in client_releasepage() is only for mount/umount, this implementation is not wise. 2) There is the fact that a spinlock is necessary for blkdev_releasepage(). This fact prevents us from making various implementations of client_releasepage(). (Without a spinlock, we can implement a client_releasepage() which can release the buffers with a sleep. As a result, it may enable more buffers release than before.) There is the fact that a filesystem can be mounted on several places, and the lock mechanism is absolutely necessary for this fact. I also think we are sad that we cannot implement various implementations for client_releasepage(). But now I cannot imagine what to do for a client_releasepage() which can sleep, too... Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-18 5:15 ` Toshiyuki Okajima @ 2008-12-18 13:12 ` Jan Kara 2008-12-18 14:54 ` Theodore Tso 2008-12-19 5:15 ` Toshiyuki Okajima 0 siblings, 2 replies; 25+ messages in thread From: Jan Kara @ 2008-12-18 13:12 UTC (permalink / raw) To: Toshiyuki Okajima; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel Hello, On Thu 18-12-08 14:15:25, Toshiyuki Okajima wrote: > > > > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > > > > > > > > Implement blkdev_releasepage() to release the buffer_heads and page > > > > after we release private data which belongs to a client of the block > > > > device, such as a filesystem. > > > > > > > > blkdev_releasepage() call the client's releasepage() which is > > > > registered by blkdev_register_client_releasepage() to release its > > > > private data. > > Yes, this is IMO the right fix. I'm just wondering about the fact that we > > can't block in the client_releasepage(). That seems to be caused by the fact > > that we need to be protected against client_releasepage() callback changes > > which essentially means umount, right? I'm not saying I have a better solution > > but introducing such limitation seems stupid just because of umount... > > > Difference between v2 and v3 in blkdev_releasepage: > < ret = (*ei->client_releasepage)(ei->client, page, wait); > < else > -- > > /* > > * Since we are holding a spinlock (ei->client_lock), > > * make sure the client_releasepage function > > * understands that it must not block. > > */ > > ret = (*ei->client_releasepage)(ei->client, page, > > wait & ~__GFP_WAIT); > > else > > Ask for clarification. Yes, my question was more about the original design of the patch than about the particular fix. Sorry for the confusion. > Which of the following do you mean: > 1) If using a spinlock in client_releasepage() is only for mount/umount, > this implementation is not wise. > 2) There is the fact that a spinlock is necessary for blkdev_releasepage(). > This fact prevents us from making various implementations of > client_releasepage(). > (Without a spinlock, we can implement a client_releasepage() which can release > the buffers with a sleep. As a result, it may enable more buffers release than > before.) > > There is the fact that a filesystem can be mounted on several places, > and the lock mechanism is absolutely necessary for this fact. This is the thing I was wondering about. Why exactly is the spinlock necessary for blkdev_releasepage()? I understand we have to protect reading client_releasepage() pointer because it could change but my point was that it changes only during mount / umount. > I also think we are sad that we cannot implement various implementations for > client_releasepage(). But now I cannot imagine what to do for > a client_releasepage() which can sleep, too... Regards Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-18 13:12 ` Jan Kara @ 2008-12-18 14:54 ` Theodore Tso 2008-12-18 16:38 ` Jan Kara 2008-12-19 5:15 ` Toshiyuki Okajima 1 sibling, 1 reply; 25+ messages in thread From: Theodore Tso @ 2008-12-18 14:54 UTC (permalink / raw) To: Jan Kara; +Cc: Toshiyuki Okajima, Ext4 Developers List, linux-fsdevel On Thu, Dec 18, 2008 at 02:12:34PM +0100, Jan Kara wrote: > This is the thing I was wondering about. Why exactly is the spinlock > necessary for blkdev_releasepage()? I understand we have to protect > reading client_releasepage() pointer because it could change but my point > was that it changes only during mount / umount. Hmm.... I suppose we could use RCU, but then we'd have to worry about the race condition where client_releasepage() gets called after the umount has happened. > > I also think we are sad that we cannot implement various > > implementations for client_releasepage(). But now I cannot imagine > > what to do for a client_releasepage() which can sleep, too... My suggestion is that we not worry about making changes to fs/block_dev.c to allow client_releasepage() to sleep until we have filesystems that really need client_releasepage() to sleep. It probably is possible, with appropriate atomic bit sets for flags to indicate an unmount in progress, and client_releasepage in progress, and use of RCU, we could allow client_releasepage. But it might not be worth it unless there is a filesystem that really needs it. - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-18 14:54 ` Theodore Tso @ 2008-12-18 16:38 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2008-12-18 16:38 UTC (permalink / raw) To: Theodore Tso; +Cc: Toshiyuki Okajima, Ext4 Developers List, linux-fsdevel On Thu 18-12-08 09:54:00, Theodore Tso wrote: > On Thu, Dec 18, 2008 at 02:12:34PM +0100, Jan Kara wrote: > > This is the thing I was wondering about. Why exactly is the spinlock > > necessary for blkdev_releasepage()? I understand we have to protect > > reading client_releasepage() pointer because it could change but my point > > was that it changes only during mount / umount. > > Hmm.... I suppose we could use RCU, but then we'd have to worry about > the race condition where client_releasepage() gets called after the > umount has happened. Yes. Actually, I'm not so much against spinlock for obtaining the function pointer but you needn't hold it when you actually call the function. Except that you have to care about those umount races, I agree. > > > I also think we are sad that we cannot implement various > > > implementations for client_releasepage(). But now I cannot imagine > > > what to do for a client_releasepage() which can sleep, too... > > My suggestion is that we not worry about making changes to > fs/block_dev.c to allow client_releasepage() to sleep until we have > filesystems that really need client_releasepage() to sleep. It > probably is possible, with appropriate atomic bit sets for flags to > indicate an unmount in progress, and client_releasepage in progress, > and use of RCU, we could allow client_releasepage. But it might not > be worth it unless there is a filesystem that really needs it. OK. Actually, there is a possibility when ext3/4 might want wait - if someone does direct IO on the block device, it might fail with EIO because the page is pinned by JBD and we have to wait for it. But OTOH doing dio on a block device with a mounted filesystem is *really* a stupid thing to do so I agree we don't care. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-18 13:12 ` Jan Kara 2008-12-18 14:54 ` Theodore Tso @ 2008-12-19 5:15 ` Toshiyuki Okajima 1 sibling, 0 replies; 25+ messages in thread From: Toshiyuki Okajima @ 2008-12-19 5:15 UTC (permalink / raw) To: Jan Kara; +Cc: Theodore Ts'o, Ext4 Developers List, linux-fsdevel Hello. Jan Kara wrote: > Hello, > <SNIP> > > Which of the following do you mean: > > 1) If using a spinlock in client_releasepage() is only for mount/umount, > > this implementation is not wise. > > 2) There is the fact that a spinlock is necessary for blkdev_releasepage(). > > This fact prevents us from making various implementations of > > client_releasepage(). > > (Without a spinlock, we can implement a client_releasepage() which can release > > the buffers with a sleep. As a result, it may enable more buffers release than > > before.) > > > > There is the fact that a filesystem can be mounted on several places, > > and the lock mechanism is absolutely necessary for this fact. > This is the thing I was wondering about. Why exactly is the spinlock > necessary for blkdev_releasepage()? I understand we have to protect > reading client_releasepage() pointer because it could change but my point > was that it changes only during mount / umount. There are 2 purposes of this lock. 1) The race between filesystem's mount and umount. (So that a filesystem can be mounted on several places concurrently.) ------------------------------------------------------------------ Without this lock, there is a possibility that the pointer of ei->client_releasepage becomes NULL by umount. As a result, a special releasepage for its filesystem is not used even if its filesystem has been mounted. ------------------------------------------------------------------ 2) The race between the usage of blkdev_releasepage() and umount. ------------------------------------------------------------------ Without this lock, there is a possibility that the pointer of ei->client_releasepage becomes NULL by umount. As a result, the process which calls blkdev_releasepage() may experience a page fault. Because blkdev_releasepage() refers the value ei->client_releasepage and then calls it as a function. But even if the pointer is not NULL, there is a possibility that a filesystem which has it has been unmounted. Besides, there is a possibility that the module of the filesystem has been unloaded. In this case, something wrong can happen. (Example: While a filesystem is being unmounted, one of its resources can be touched by using the ei->client_releasepage of the filesystem by the side of calling blkdev_releasepage.) ------------------------------------------------------------------ Therefore some lock mechanisms are necessary to solve the races. Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2008-12-17 15:39 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara @ 2008-12-26 5:01 ` Al Viro 2009-01-03 15:09 ` Theodore Ts'o 2 siblings, 1 reply; 25+ messages in thread From: Al Viro @ 2008-12-26 5:01 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List, Toshiyuki Okajima, linux-fsdevel On Fri, Dec 12, 2008 at 12:52:53PM -0500, Theodore Ts'o wrote: > From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > > Implement blkdev_releasepage() to release the buffer_heads and page > after we release private data which belongs to a client of the block > device, such as a filesystem. > > blkdev_releasepage() call the client's releasepage() which is > registered by blkdev_register_client_releasepage() to release its > private data. > > Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: linux-fsdevel@vger.kernel.org Entire-pile-NAKed-by: Al Viro <viro@zeniv.linux.org.uk> a) Use of fs_type to hide a callback is wrong. Pass it explicitly to whatever helper you want, don't do that crapin get_sb_bdev() b) the same goes from unregistration c) you are unregistering your callback before doing generic_shutdown_super(). Odd, seeing that a _lot_ of fs operations can happen at that point. d) we *already* have exclusion mechanism. It's called open_bdev_exclusive() and it is used by get_sb_bdev() and friends already. Don't reinvent the wheel, please. e) what's going on with the locking there? Comments about mount/umount races are absolutely bogus - we *have* serialization for fs shutdown/startup. And if we hadn't, we would have far worse problems with races than that. The other kind of race is possible, but... this interface is asking for trouble. It sounds like a way to attach some data structures of your own to page and rely on that callback for freeing them. But as soon as somebody tries that we'll have a problem; page can outlive the unregistration of callback and we'll get a leak (in the best case). Sure, ext3 and ext4 won't step into it (journal shutdown will deal with that), but it's a trap for unaware. At the very least it needs to be commented. Said that, I still don't like the use of rwlock here ;-/ If nothing else, that calls for rcu - fs shutdown is extremely rare compared to releasepage... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems 2008-12-26 5:01 ` Al Viro @ 2009-01-03 15:09 ` Theodore Ts'o 2009-01-03 15:09 ` [PATCH 1/3] add releasepage " Theodore Ts'o 0 siblings, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw) To: Al Viro; +Cc: Ext4 Developers List Al, Thanks for your comments; they were quite useful. I've reworked Okajima-san's patches to address your concerns. What do you think? - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems 2009-01-03 15:09 ` Theodore Ts'o @ 2009-01-03 15:09 ` Theodore Ts'o 2009-01-03 15:09 ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2009-01-05 8:16 ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima 0 siblings, 2 replies; 25+ messages in thread From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw) To: Al Viro Cc: Ext4 Developers List, Theodore Ts'o, Toshiyuki Okajima, linux-fsdevel Implement blkdev_releasepage() to release the buffer_heads and pages after we release private data belonging to a mounted filesystem. Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/block_dev.c | 15 +++++++++++++++ fs/super.c | 2 ++ include/linux/fs.h | 2 ++ 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 99e0ae1..ef7d795 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1219,6 +1219,20 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) return blkdev_ioctl(bdev, mode, cmd, arg); } +/* + * Try to release a page associated with block device when the system + * is under memory pressure. + */ +static int blkdev_releasepage(struct page *page, gfp_t wait) +{ + struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super; + + if (super && super->s_op->bdev_try_to_free_page) + return super->s_op->bdev_try_to_free_page(super, page, wait); + + return try_to_free_buffers(page); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, @@ -1226,6 +1240,7 @@ static const struct address_space_operations def_blk_aops = { .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, .writepages = generic_writepages, + .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, }; diff --git a/fs/super.c b/fs/super.c index 400a760..d7e200d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -800,6 +800,7 @@ int get_sb_bdev(struct file_system_type *fs_type, } s->s_flags |= MS_ACTIVE; + bdev->bd_super = s; } return simple_set_mnt(mnt, s); @@ -819,6 +820,7 @@ void kill_block_super(struct super_block *sb) struct block_device *bdev = sb->s_bdev; fmode_t mode = sb->s_mode; + bdev->bd_super = 0; generic_shutdown_super(sb); sync_blockdev(bdev); close_bdev_exclusive(bdev, mode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 4a853ef..911f812 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -553,6 +553,7 @@ struct address_space { struct block_device { dev_t bd_dev; /* not a kdev_t - it's a search key */ struct inode * bd_inode; /* will die */ + struct super_block * bd_super; int bd_openers; struct mutex bd_mutex; /* open/close mutex */ struct semaphore bd_mount_sem; @@ -1375,6 +1376,7 @@ struct super_operations { ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif + int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); }; /* -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure 2009-01-03 15:09 ` [PATCH 1/3] add releasepage " Theodore Ts'o @ 2009-01-03 15:09 ` Theodore Ts'o 2009-01-03 15:09 ` [PATCH 3/3] ext4: " Theodore Ts'o 2009-01-05 8:16 ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima 1 sibling, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw) To: Al Viro Cc: Ext4 Developers List, Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext3 data files are released via the ext3_releasepage() function specified in the ext3 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a try_to_free_pages() function which calls journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext3/super.c | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 541d5e4..93655a6 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -682,6 +682,25 @@ static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid *fid, ext3_nfs_get_inode); } +/* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd layer's try_to_free_buffers() function to release them. + */ +static int bdev_try_to_free_page(struct super_block *sb, struct page *page, + gfp_t wait) +{ + journal_t *journal = EXT3_SB(sb)->s_journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + if (journal) + return journal_try_to_free_buffers(journal, page, wait); + return try_to_free_buffers(page); +} + #ifdef CONFIG_QUOTA #define QTYPE2NAME(t) ((t)==USRQUOTA?"user":"group") #define QTYPE2MOPT(on, t) ((t)==USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA)) @@ -746,6 +765,7 @@ static const struct super_operations ext3_sops = { .quota_read = ext3_quota_read, .quota_write = ext3_quota_write, #endif + .bdev_try_to_free_page = bdev_try_to_free_page, }; static const struct export_operations ext3_export_ops = { -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 3/3] ext4: provide function to release metadata pages under memory pressure 2009-01-03 15:09 ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2009-01-03 15:09 ` Theodore Ts'o 0 siblings, 0 replies; 25+ messages in thread From: Theodore Ts'o @ 2009-01-03 15:09 UTC (permalink / raw) To: Al Viro Cc: Ext4 Developers List, Toshiyuki Okajima, Theodore Ts'o, linux-fsdevel From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which calls jbd2_journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/super.c | 19 +++++++++++++++++++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0e0653d..980391d 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -907,6 +907,24 @@ static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid, ext4_nfs_get_inode); } +/* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd2 layer's try_to_free_buffers() function to release them. + */ +static int bdev_try_to_free_page(struct super_block *sb, struct page *page, gfp_t wait) +{ + journal_t *journal = EXT4_SB(sb)->s_journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + if (journal) + return jbd2_journal_try_to_free_buffers(journal, page, wait); + return try_to_free_buffers(page); +} + #ifdef CONFIG_QUOTA #define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group") #define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA)) @@ -971,6 +989,7 @@ static const struct super_operations ext4_sops = { .quota_read = ext4_quota_read, .quota_write = ext4_quota_write, #endif + .bdev_try_to_free_page = bdev_try_to_free_page, }; static const struct export_operations ext4_export_ops = { -- 1.6.0.4.8.g36f27.dirty ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems 2009-01-03 15:09 ` [PATCH 1/3] add releasepage " Theodore Ts'o 2009-01-03 15:09 ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o @ 2009-01-05 8:16 ` Toshiyuki Okajima 2009-01-05 16:05 ` Theodore Tso 1 sibling, 1 reply; 25+ messages in thread From: Toshiyuki Okajima @ 2009-01-05 8:16 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel Hi Ted-san, Theodore Ts'o wrote: > Implement blkdev_releasepage() to release the buffer_heads and pages > after we release private data belonging to a mounted filesystem. > > Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> > Cc: linux-fsdevel@vger.kernel.org > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > --- > fs/block_dev.c | 15 +++++++++++++++ > fs/super.c | 2 ++ > include/linux/fs.h | 2 ++ > 3 files changed, 19 insertions(+), 0 deletions(-) I was confirming whether the kernel to which your new patch is applied can run without trouble. But unfortunately, I got a hangup problem. Now I am investigating the root cause. After I investigated it for a little time, I think calling log_wait_commit() from journal_try_to_free_buffers() can cause it. I examine it a little more in detail. Additional Info(Crash dump): Backtrace: crash> bt PID: 260 TASK: f71076d0 CPU: 1 COMMAND: "kswapd0" #0 [f707dcbc] schedule at c06346a3 #1 [f707dd34] log_wait_commit at f80904c1 #2 [f707dd70] journal_try_to_free_buffers at f808c81f #3 [f707dd94] blkdev_releasepage at c04916cc #4 [f707dda4] try_to_release_page at c04526b1 #5 [f707ddb0] shrink_page_list at c045b3d1 #6 [f707de50] shrink_list at c045b72e #7 [f707def0] shrink_zone at c045bbc6 #8 [f707df40] kswapd at c045c12c #9 [f707dfd8] kthread at c043612c #10 [f707dfe4] kernel_thread_helper at c04045e1 Sleep time: crash> ps -l | head -n 1 [5360808577593] PID: 7995 TASK: c98b76d0 CPU: 1 COMMAND: "crtfile" crash> ps -l 260 [3727586943566] PID: 260 TASK: f71076d0 CPU: 1 COMMAND: "kswapd0" crash> p (5360808577593 - 3727586943566)/1000000000 $4 = 1633 ======> 1633 seconds Best Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems 2009-01-05 8:16 ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima @ 2009-01-05 16:05 ` Theodore Tso 2009-01-06 4:07 ` Toshiyuki Okajima 0 siblings, 1 reply; 25+ messages in thread From: Theodore Tso @ 2009-01-05 16:05 UTC (permalink / raw) To: Toshiyuki Okajima; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote: > > I was confirming whether the kernel to which your new patch is > applied can run without trouble. But unfortunately, I got a hangup > problem. Now I am investigating the root cause. After I > investigated it for a little time, I think calling log_wait_commit() > from journal_try_to_free_buffers() can cause it. Sounds like a deadlock caused by the fact that we're no longer masking __GFP_WAIT, probably on journal->j_wait_done_commit. Presumably the system came under pressure during a commit operation, which makes sense, and so we ended up with a deadlock between kjournald and kswapd. The fix is pretty simple; we just need to mask out the __GFP_WAIT in the filesystem-specific callback, since this is a restriction imposed by the filesystem's use of the jbd/jbd2 layer. - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems 2009-01-05 16:05 ` Theodore Tso @ 2009-01-06 4:07 ` Toshiyuki Okajima 2009-01-06 4:29 ` Theodore Tso 0 siblings, 1 reply; 25+ messages in thread From: Toshiyuki Okajima @ 2009-01-06 4:07 UTC (permalink / raw) To: Theodore Tso; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel Ted-san, Theodore Tso wrote: > On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote: > > > > > > I was confirming whether the kernel to which your new patch is > > > applied can run without trouble. But unfortunately, I got a hangup > > > problem. Now I am investigating the root cause. After I > > > investigated it for a little time, I think calling log_wait_commit() > > > from journal_try_to_free_buffers() can cause it. > > Sounds like a deadlock caused by the fact that we're no longer masking > __GFP_WAIT, probably on journal->j_wait_done_commit. Presumably the > system came under pressure during a commit operation, which makes > sense, and so we ended up with a deadlock between kjournald and > kswapd. The fix is pretty simple; we just need to mask out the > __GFP_WAIT in the filesystem-specific callback, since this is a > restriction imposed by the filesystem's use of the jbd/jbd2 layer. Your opinion is correct. A detailed investigation is done, and the root cause has been understood. The deadlock was caused by the following two processes: (1) A certain process Memory collecting process which is started by a memory allocator calls journal_try_to_free_buffers(). And then it calls log_wait_commit() to get more memory and waits for the finish of one committing transaction. (2) kjournald process kjournald process starts by Process (1) calling log_wait_commit(). And then it calls journal_commit_transaction to write all data buffers into the filesystem and write all metadata buffers into the journal storage. Writing metadata buffer is journal_write_metadata_buffer(). This function also needs new buffer_head (more memory) in order to copy a buffer_head. Detailed Information: Process (1): crash> bt 260 PID: 260 TASK: f71076d0 CPU: 1 COMMAND: "kswapd0" #0 [f707dcbc] schedule at c06346a3 #1 [f707dd34] log_wait_commit at f80904c1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -> It lets kjournald start and waits for the commit. #2 [f707dd70] journal_try_to_free_buffers at f808c81f #3 [f707dd94] blkdev_releasepage at c04916cc #4 [f707dda4] try_to_release_page at c04526b1 #5 [f707ddb0] shrink_page_list at c045b3d1 #6 [f707de50] shrink_list at c045b72e #7 [f707def0] shrink_zone at c045bbc6 #8 [f707df40] kswapd at c045c12c #9 [f707dfd8] kthread at c043612c #10 [f707dfe4] kernel_thread_helper at c04045e1 journal structure: 0xccab1e00 Process (2) [kjournald]: PID: 3170 TASK: f717b240 CPU: 1 COMMAND: "kjournald" #0 [c42b4cf4] schedule at c06346a3 #1 [c42b4d6c] schedule_timeout at c06349ef #2 [c42b4d90] io_schedule_timeout at c0633e0f #3 [c42b4da0] congestion_wait at c045d7ee #4 [c42b4dc8] try_to_free_pages at c045c82a #5 [c42b4e2c] __alloc_pages_internal at c04579fc #6 [c42b4e70] cache_alloc_refill at c0471235 #7 [c42b4ec0] kmem_cache_alloc at c0470fa8 #8 [c42b4ed4] alloc_buffer_head at c048c06b ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-> It tries to get a buffer but cannot get one. Because memory collectors (include: process (1)) cannot go farther. #9 [c42b4edc] journal_write_metadata_buffer at f8090eb6 #10 [c42b4f10] journal_commit_transaction at f808df80 #11 [c42b4f98] kjournald at f809089d #12 [c42b4fd8] kthread at c043612c #13 [c42b4fe4] kernel_thread_helper at c04045e1 journal structure: 0xccab1e00 Additional Information: The process by which the trigger of a deadlock is pulled is not only kswapd. [1] PID: 1800 TASK: f7379b60 CPU: 1 COMMAND: "rsyslogd" #0 [f61c3bfc] schedule at c06346a3 #1 [f61c3c74] log_wait_commit at f80904c1 #2 [f61c3cb0] journal_try_to_free_buffers at f808c81f #3 [f61c3cd4] blkdev_releasepage at c04916cc #4 [f61c3ce4] try_to_release_page at c04526b1 #5 [f61c3cf0] shrink_page_list at c045b3d1 #6 [f61c3d90] shrink_list at c045b72e #7 [f61c3e30] shrink_zone at c045bbc6 #8 [f61c3e80] try_to_free_pages at c045c787 #9 [f61c3ee4] __alloc_pages_internal at c04579fc #10 [f61c3f28] __get_free_pages at c0457bac #11 [f61c3f30] copy_process at c0425823 #12 [f61c3f68] do_fork at c042674b #13 [f61c3fa4] sys_clone at c0402399 #14 [f61c3fb4] system_call at c0403893 EAX: ffffffda EBX: 003d0f00 ECX: b7fcd4b4 EDX: b7fcdbd8 DS: 007b ESI: b6fcb16c ES: 007b EDI: b7fcdbd8 SS: 007b ESP: b6fcb100 EBP: b6fcb198 CS: 0073 EIP: 00d271f8 ERR: 00000078 EFLAGS: 00000296 [2] PID: 1990 TASK: f70c6000 CPU: 0 COMMAND: "pcscd" #0 [f6078be0] schedule at c06346a3 #1 [f6078c58] log_wait_commit at f80904c1 #2 [f6078c94] journal_try_to_free_buffers at f808c81f #3 [f6078cb8] blkdev_releasepage at c04916cc #4 [f6078cc8] try_to_release_page at c04526b1 #5 [f6078cd4] shrink_page_list at c045b3d1 #6 [f6078d74] shrink_list at c045b72e #7 [f6078e14] shrink_zone at c045bbc6 #8 [f6078e64] try_to_free_pages at c045c787 #9 [f6078ec8] __alloc_pages_internal at c04579fc #10 [f6078f0c] cache_alloc_refill at c0471235 #11 [f6078f5c] kmem_cache_alloc at c0470fa8 #12 [f6078f70] getname at c047b71c #13 [f6078f88] do_sys_open at c04729d2 #14 [f6078fa0] sys_open at c0472ab6 #15 [f6078fb4] ia32_sysenter_target at c04037da EAX: 00000005 EBX: 006a2700 ECX: 00098800 EDX: 00000000 DS: 007b ESI: 006a2700 ES: 007b EDI: 00000000 SS: 007b ESP: b801d0f8 EBP: b801d188 CS: 0073 EIP: b803f424 ERR: 00000005 EFLAGS: 00000202 ... Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems 2009-01-06 4:07 ` Toshiyuki Okajima @ 2009-01-06 4:29 ` Theodore Tso 0 siblings, 0 replies; 25+ messages in thread From: Theodore Tso @ 2009-01-06 4:29 UTC (permalink / raw) To: Toshiyuki Okajima; +Cc: Al Viro, Ext4 Developers List, linux-fsdevel On Tue, Jan 06, 2009 at 01:07:27PM +0900, Toshiyuki Okajima wrote: > > Sounds like a deadlock caused by the fact that we're no longer masking > > __GFP_WAIT, probably on journal->j_wait_done_commit. Presumably the > > system came under pressure during a commit operation, which makes > > sense, and so we ended up with a deadlock between kjournald and > > kswapd. The fix is pretty simple; we just need to mask out the > > __GFP_WAIT in the filesystem-specific callback, since this is a > > restriction imposed by the filesystem's use of the jbd/jbd2 layer. > > Your opinion is correct. > A detailed investigation is done, and the root cause has been understood. Thanks for confirming my supposition. I've checked the following modified patches into the ext4 patch queue, which will be sent to Linus shortly. - Ted ext3: provide function to release metadata pages under memory pressure From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext3 data files are released via the ext3_releasepage() function specified in the ext3 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a try_to_free_pages() function which calls journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 541d5e4..6900ff0 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -682,6 +682,26 @@ static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid *fid, ext3_nfs_get_inode); } +/* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd layer's try_to_free_buffers() function to release them. + */ +static int bdev_try_to_free_page(struct super_block *sb, struct page *page, + gfp_t wait) +{ + journal_t *journal = EXT3_SB(sb)->s_journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + if (journal) + return journal_try_to_free_buffers(journal, page, + wait & ~__GFP_WAIT); + return try_to_free_buffers(page); +} + #ifdef CONFIG_QUOTA #define QTYPE2NAME(t) ((t)==USRQUOTA?"user":"group") #define QTYPE2MOPT(on, t) ((t)==USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA)) @@ -746,6 +766,7 @@ static const struct super_operations ext3_sops = { .quota_read = ext3_quota_read, .quota_write = ext3_quota_write, #endif + .bdev_try_to_free_page = bdev_try_to_free_page, }; static const struct export_operations ext3_export_ops = { ---------------------- ext4: provide function to release metadata pages under memory pressure From: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which calls jbd2_journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0023ca9..5cfac23 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -908,6 +908,25 @@ static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid, ext4_nfs_get_inode); } +/* + * Try to release metadata pages (indirect blocks, directories) which are + * mapped via the block device. Since these pages could have journal heads + * which would prevent try_to_free_buffers() from freeing them, we must use + * jbd2 layer's try_to_free_buffers() function to release them. + */ +static int bdev_try_to_free_page(struct super_block *sb, struct page *page, gfp_t wait) +{ + journal_t *journal = EXT4_SB(sb)->s_journal; + + WARN_ON(PageChecked(page)); + if (!page_has_buffers(page)) + return 0; + if (journal) + return jbd2_journal_try_to_free_buffers(journal, page, + wait & ~__GFP_WAIT); + return try_to_free_buffers(page); +} + #ifdef CONFIG_QUOTA #define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group") #define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA)) @@ -972,6 +991,7 @@ static const struct super_operations ext4_sops = { .quota_read = ext4_quota_read, .quota_write = ext4_quota_write, #endif + .bdev_try_to_free_page = bdev_try_to_free_page, }; static const struct export_operations ext4_export_ops = { ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage 2008-12-12 6:21 ` Theodore Tso 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o @ 2008-12-15 2:21 ` Toshiyuki Okajima 1 sibling, 0 replies; 25+ messages in thread From: Toshiyuki Okajima @ 2008-12-15 2:21 UTC (permalink / raw) To: Theodore Tso; +Cc: aneesh.kumar, balbir, linux-ext4 Ted-san, Theodore Tso wrote: > On Fri, Dec 12, 2008 at 09:54:18AM +0900, Toshiyuki Okajima wrote: > > > To tell the truth, at first, I imagined the same patch as yours to fix this > > > problem. But I have made another patch because I thought that ext3(or ext4) > > > should not know the contents of the processing of journal_try_to_free_buffers > > > in detail. (ext3 should not know there is a possibility to call > > > journal_wait_for_transaction_sync_data from journal_try_to_free_buffers.) > > I agree, but ext3 doesn't need to know that. What my change did was > to mask off the _GFP_WAIT flag, which prohibits the function it calls > from blocking, because it knows that its caller is holding a spinlock. > > And actually, come to think of it. We can do even better; the right > fix is to have blkdev_releasepage() mask off the _GFP_WAIT flag; this > is the function which is taking spinlock, and by masking off the > __GFP_WAIT flag, this is simply requesting all of the downstream > functions not to block, but to do the best job they can do without > blocking. It doesn't need to know whether it's going to call > log_wait_commit(), or anything else; all it needs to do is request > "please don't block". > That means we only make the request once, in the function which is > taking spinlock, so all of the per-filesystem implementations of > release_metadata() don't need to know that its caller is holding a > spinlock. OK. I agree your last change. I also think blkdev_releasepage must do something so that downstream functions of it don't sleep. Masking off the _GFP_WAIT flag is the easiest achievement of it. Besides, I think it is not valid implementation that brings us a care about ei->client_releasepage's sleeping. Additional Information: I did an easy test with your last change and then I haven't experienced any errors. Regards, Toshiyuki Okajima ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2009-01-06 4:29 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-12-02 11:06 [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 2008-12-08 14:01 ` Theodore Tso 2008-12-08 14:06 ` [PATCH -V2] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2008-12-08 14:06 ` [PATCH -V2] ext4: " Theodore Ts'o 2008-12-12 0:54 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima 2008-12-12 6:21 ` Theodore Tso 2008-12-12 17:52 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2008-12-12 17:52 ` [PATCH -v3] ext4: " Theodore Ts'o 2008-12-17 15:39 ` [PATCH -v3] vfs: add releasepages hooks to block devices which can be used by file systems Jan Kara 2008-12-18 5:15 ` Toshiyuki Okajima 2008-12-18 13:12 ` Jan Kara 2008-12-18 14:54 ` Theodore Tso 2008-12-18 16:38 ` Jan Kara 2008-12-19 5:15 ` Toshiyuki Okajima 2008-12-26 5:01 ` Al Viro 2009-01-03 15:09 ` Theodore Ts'o 2009-01-03 15:09 ` [PATCH 1/3] add releasepage " Theodore Ts'o 2009-01-03 15:09 ` [PATCH 2/3] ext3: provide function to release metadata pages under memory pressure Theodore Ts'o 2009-01-03 15:09 ` [PATCH 3/3] ext4: " Theodore Ts'o 2009-01-05 8:16 ` [PATCH 1/3] add releasepage hooks to block devices which can be used by file systems Toshiyuki Okajima 2009-01-05 16:05 ` Theodore Tso 2009-01-06 4:07 ` Toshiyuki Okajima 2009-01-06 4:29 ` Theodore Tso 2008-12-15 2:21 ` [BUG][PATCH 1/4] ext3: fix a cause of __schedule_bug via blkdev_releasepage Toshiyuki Okajima
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).