linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] Eliminate most lock_super() calls from ext4
@ 2009-04-26  3:49 Theodore Ts'o
  2009-04-26  3:49 ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro


On Fri, Apr 24, 2009 at 08:40:47PM +0200, Christoph Hellwig wrote:
> ext3/4 internal bits?  Doesn't seem to be used for any journal related
> activity but mostly as protection against resizing (the whole lock_super
> usage in ext3/4 looks odd to me, interestingly there's none at all in
> ext2.  Maybe someone of the extN crowd should audit and get rid of it in
> favour of a better fs-specific lock)

Here are some patches which eliminate most of the lock_super() and
unlock_super() calls from ext4.  The last remaining calls are designed
to proect against another CPU calling write_super(), which is the
original intended use.   

If these patches work out, we can backport these patches to ext3.

Comments and review appreciated; thanks!!

						- Ted


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering
  2009-04-26  3:49 [PATCH RFC 0/5] Eliminate most lock_super() calls from ext4 Theodore Ts'o
@ 2009-04-26  3:49 ` Theodore Ts'o
  2009-04-26  3:49   ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Theodore Ts'o
  2009-04-28 16:23   ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Jan Kara
  0 siblings, 2 replies; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro,
	Theodore Ts'o

Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count.  However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group.  For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any, say, block group
descriptors allowed by the block group count.

Unfortunately, we rather blatently violate this locking protocol as
documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves.  So apparently problems
here are relatively hard to hit, since ext3 also is vulnerable to this
race and no one has apparently complained.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/balloc.c  |    6 ++++--
 fs/ext4/ialloc.c  |   34 +++++++++++++++++++++-------------
 fs/ext4/mballoc.c |   49 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 53c72ad..d1615f2 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -88,9 +88,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
 		 ext4_group_t block_group, struct ext4_group_desc *gdp)
 {
 	int bit, bit_max;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
 	unsigned free_blocks, group_blocks;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	smp_rmb();		/* after reading s_groups_count first */
 	if (bh) {
 		J_ASSERT_BH(bh, buffer_locked(bh));
 
@@ -123,7 +125,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
 		bit_max += ext4_bg_num_gdb(sb, block_group);
 	}
 
-	if (block_group == sbi->s_groups_count - 1) {
+	if (block_group == ngroups - 1) {
 		/*
 		 * Even though mke2fs always initialize first and last group
 		 * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
@@ -131,7 +133,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
 		 */
 		group_blocks = ext4_blocks_count(sbi->s_es) -
 			le32_to_cpu(sbi->s_es->s_first_data_block) -
-			(EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
+			(EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
 	} else {
 		group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
 	}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f18e0a0..52ce274 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -322,6 +322,7 @@ static int find_group_dir(struct super_block *sb, struct inode *parent,
 	ext4_group_t group;
 	int ret = -1;
 
+	smp_rmb();		/* after reading s_groups_count first */
 	freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
 	avefreei = freei / ngroups;
 
@@ -362,7 +363,8 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
 	ext4_group_t n_fbg_groups;
 	ext4_group_t i;
 
-	n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
+	smp_rmb();		/* after reading s_groups_count first */
+	n_fbg_groups = (ngroups + flex_size - 1) >>
 		sbi->s_log_groups_per_flex;
 
 find_close_to_parent:
@@ -478,18 +480,18 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
 {
 	ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	ext4_group_t ngroups = sbi->s_groups_count;
 	int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
 	unsigned int freei, avefreei;
 	ext4_fsblk_t freeb, avefreeb;
 	unsigned int ndirs;
 	int max_dirs, min_inodes;
 	ext4_grpblk_t min_blocks;
-	ext4_group_t i, grp, g;
+	ext4_group_t i, grp, g, ngroups = sbi->s_groups_count;;
 	struct ext4_group_desc *desc;
 	struct orlov_stats stats;
 	int flex_size = ext4_flex_bg_size(sbi);
 
+	smp_rmb();		/* after reading s_groups_count first */
 	if (flex_size > 1) {
 		ngroups = (ngroups + flex_size - 1) >>
 			sbi->s_log_groups_per_flex;
@@ -585,6 +587,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
 fallback:
 	ngroups = sbi->s_groups_count;
 	avefreei = freei / ngroups;
+	smp_rmb();
 fallback_retry:
 	parent_group = EXT4_I(parent)->i_block_group;
 	for (i = 0; i < ngroups; i++) {
@@ -613,11 +616,11 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
 			    ext4_group_t *group, int mode)
 {
 	ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
-	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+	ext4_group_t i, last, ngroups = EXT4_SB(sb)->s_groups_count;
 	struct ext4_group_desc *desc;
-	ext4_group_t i, last;
 	int flex_size = ext4_flex_bg_size(EXT4_SB(sb));
 
+	smp_rmb();		/* after reading s_groups_count first */
 	/*
 	 * Try to place the inode is the same flex group as its
 	 * parent.  If we can't find space, use the Orlov algorithm to
@@ -799,7 +802,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
 	struct super_block *sb;
 	struct buffer_head *inode_bitmap_bh = NULL;
 	struct buffer_head *group_desc_bh;
-	ext4_group_t group = 0;
+	ext4_group_t ngroups, group = 0;
 	unsigned long ino = 0;
 	struct inode *inode;
 	struct ext4_group_desc *gdp = NULL;
@@ -851,12 +854,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
 		ret2 = find_group_other(sb, dir, &group, mode);
 
 got_group:
+	ngroups = sbi->s_groups_count;
+	smp_rmb();
 	EXT4_I(dir)->i_last_alloc_group = group;
 	err = -ENOSPC;
 	if (ret2 == -1)
 		goto out;
 
-	for (i = 0; i < sbi->s_groups_count; i++) {
+	for (i = 0; i < ngroups; i++) {
 		err = -EIO;
 
 		gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
@@ -917,7 +922,7 @@ repeat_in_this_group:
 		 * group descriptor metadata has not yet been updated.
 		 * So we just go onto the next blockgroup.
 		 */
-		if (++group == sbi->s_groups_count)
+		if (++group == ngroups)
 			group = 0;
 	}
 	err = -ENOSPC;
@@ -1158,17 +1163,18 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
 {
 	unsigned long desc_count;
 	struct ext4_group_desc *gdp;
-	ext4_group_t i;
+	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
 #ifdef EXT4FS_DEBUG
 	struct ext4_super_block *es;
 	unsigned long bitmap_count, x;
 	struct buffer_head *bitmap_bh = NULL;
 
+	smp_rmb();		/* after reading s_groups_count first */
 	es = EXT4_SB(sb)->s_es;
 	desc_count = 0;
 	bitmap_count = 0;
 	gdp = NULL;
-	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+	for (i = 0; i < ngroups; i++) {
 		gdp = ext4_get_group_desc(sb, i, NULL);
 		if (!gdp)
 			continue;
@@ -1189,8 +1195,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
 	       le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count);
 	return desc_count;
 #else
+	smp_rmb();		/* after reading s_groups_count first */
 	desc_count = 0;
-	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+	for (i = 0; i < ngroups; i++) {
 		gdp = ext4_get_group_desc(sb, i, NULL);
 		if (!gdp)
 			continue;
@@ -1205,9 +1212,10 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
 unsigned long ext4_count_dirs(struct super_block * sb)
 {
 	unsigned long count = 0;
-	ext4_group_t i;
+	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
 
-	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+	smp_rmb();		/* after reading s_groups_count first */
+	for (i = 0; i < ngroups; i++) {
 		struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
 		if (!gdp)
 			continue;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f871677..ecc2d49 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
 
 static int ext4_mb_init_cache(struct page *page, char *incore)
 {
+	ext4_group_t ngroups;
 	int blocksize;
 	int blocks_per_page;
 	int groups_per_page;
@@ -757,6 +758,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
 
 	inode = page->mapping->host;
 	sb = inode->i_sb;
+	ngroups = EXT4_SB(sb)->s_groups_count;
+	smp_rmb();
 	blocksize = 1 << inode->i_blkbits;
 	blocks_per_page = PAGE_CACHE_SIZE / blocksize;
 
@@ -780,7 +783,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
 	for (i = 0; i < groups_per_page; i++) {
 		struct ext4_group_desc *desc;
 
-		if (first_group + i >= EXT4_SB(sb)->s_groups_count)
+		if (first_group + i >= ngroups)
 			break;
 
 		err = -EIO;
@@ -852,7 +855,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
 		struct ext4_group_info *grinfo;
 
 		group = (first_block + i) >> 1;
-		if (group >= EXT4_SB(sb)->s_groups_count)
+		if (group >= ngroups)
 			break;
 
 		/*
@@ -1788,9 +1791,11 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
 	int block, pnum;
 	int blocks_per_page;
 	int groups_per_page;
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
 	ext4_group_t first_group;
 	struct ext4_group_info *grp;
 
+	smp_rmb();		/* after reading s_groups_count first */
 	blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
 	/*
 	 * the buddy cache inode stores the block bitmap
@@ -1807,7 +1812,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
 	/* read all groups the page covers into the cache */
 	for (i = 0; i < groups_per_page; i++) {
 
-		if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
+		if ((first_group + i) >= ngroups)
 			break;
 		grp = ext4_get_group_info(sb, first_group + i);
 		/* take all groups write allocation
@@ -1945,8 +1950,7 @@ err:
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
-	ext4_group_t group;
-	ext4_group_t i;
+	ext4_group_t ngroups, group, i;
 	int cr;
 	int err = 0;
 	int bsbits;
@@ -1957,6 +1961,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	sb = ac->ac_sb;
 	sbi = EXT4_SB(sb);
+	ngroups = EXT4_SB(sb)->s_groups_count;
+	smp_rmb();
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
 	/* first, try the goal */
@@ -2017,11 +2023,11 @@ repeat:
 		 */
 		group = ac->ac_g_ex.fe_group;
 
-		for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
+		for (i = 0; i < ngroups; group++, i++) {
 			struct ext4_group_info *grp;
 			struct ext4_group_desc *desc;
 
-			if (group == EXT4_SB(sb)->s_groups_count)
+			if (group == ngroups)
 				group = 0;
 
 			/* quick check to skip empty groups */
@@ -2320,7 +2326,7 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
 
 	if (*pos < 0 || *pos >= sbi->s_groups_count)
 		return NULL;
-
+	smp_rmb();
 	group = *pos + 1;
 	return (void *) ((unsigned long) group);
 }
@@ -2334,6 +2340,7 @@ static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
 	++*pos;
 	if (*pos < 0 || *pos >= sbi->s_groups_count)
 		return NULL;
+	smp_rmb();
 	group = *pos + 1;
 	return (void *) ((unsigned long) group);
 }
@@ -2587,6 +2594,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)
 
 static int ext4_mb_init_backend(struct super_block *sb)
 {
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
 	ext4_group_t i;
 	int metalen;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -2597,8 +2605,10 @@ static int ext4_mb_init_backend(struct super_block *sb)
 	struct ext4_group_info **meta_group_info;
 	struct ext4_group_desc *desc;
 
+	smp_rmb();		/* after reading s_groups_count first */
+
 	/* This is the number of blocks used by GDT */
-	num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
+	num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
 				1) >> EXT4_DESC_PER_BLOCK_BITS(sb);
 
 	/*
@@ -2644,7 +2654,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
 	for (i = 0; i < num_meta_group_infos; i++) {
 		if ((i + 1) == num_meta_group_infos)
 			metalen = sizeof(*meta_group_info) *
-				(sbi->s_groups_count -
+				(ngroups -
 					(i << EXT4_DESC_PER_BLOCK_BITS(sb)));
 		meta_group_info = kmalloc(metalen, GFP_KERNEL);
 		if (meta_group_info == NULL) {
@@ -2655,7 +2665,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
 		sbi->s_group_info[i] = meta_group_info;
 	}
 
-	for (i = 0; i < sbi->s_groups_count; i++) {
+	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);
 		if (desc == NULL) {
 			printk(KERN_ERR
@@ -2781,13 +2791,15 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 
 int ext4_mb_release(struct super_block *sb)
 {
+	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
 	ext4_group_t i;
 	int num_meta_group_infos;
 	struct ext4_group_info *grinfo;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	smp_rmb();		/* after reading s_groups_count first */
 	if (sbi->s_group_info) {
-		for (i = 0; i < sbi->s_groups_count; i++) {
+		for (i = 0; i < ngroups; i++) {
 			grinfo = ext4_get_group_info(sb, i);
 #ifdef DOUBLE_CHECK
 			kfree(grinfo->bb_bitmap);
@@ -2797,7 +2809,7 @@ int ext4_mb_release(struct super_block *sb)
 			ext4_unlock_group(sb, i);
 			kfree(grinfo);
 		}
-		num_meta_group_infos = (sbi->s_groups_count +
+		num_meta_group_infos = (ngroups +
 				EXT4_DESC_PER_BLOCK(sb) - 1) >>
 			EXT4_DESC_PER_BLOCK_BITS(sb);
 		for (i = 0; i < num_meta_group_infos; i++)
@@ -4121,7 +4133,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
 static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
-	ext4_group_t i;
+	ext4_group_t ngroups, i;
 
 	printk(KERN_ERR "EXT4-fs: Can't allocate:"
 			" Allocation context details:\n");
@@ -4145,7 +4157,9 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 	printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
 		ac->ac_found);
 	printk(KERN_ERR "EXT4-fs: groups: \n");
-	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+	ngroups = EXT4_SB(ac->ac_sb)->s_groups_count;
+	smp_rmb();
+	for (i = 0; i < EXT4_SB(sb)->ngroups; i++) {
 		struct ext4_group_info *grp = ext4_get_group_info(sb, i);
 		struct ext4_prealloc_space *pa;
 		ext4_grpblk_t start;
@@ -4469,13 +4483,14 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 
 static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 {
-	ext4_group_t i;
+	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
 	int ret;
 	int freed = 0;
 
+	smp_rmb();		/* after reading s_groups_count first */
 	trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
 		   sb->s_id, needed);
-	for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
+	for (i = 0; i < ngroups && needed > 0; i++) {
 		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
 		freed += ret;
 		needed -= ret;
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] ext4: Remove outdated comment about lock_super()
  2009-04-26  3:49 ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Theodore Ts'o
@ 2009-04-26  3:49   ` Theodore Ts'o
  2009-04-26  3:49     ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Theodore Ts'o
  2009-04-28 16:13     ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Jan Kara
  2009-04-28 16:23   ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Jan Kara
  1 sibling, 2 replies; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro,
	Theodore Ts'o

ext4_fill_super() is no longer called by read_super(), and it is no
longer called with the superblock locked.  The
unlock_super()/lock_super() is no longer present, so this comment is
entirely superfluous.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/super.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e20ff9c..e0b0c9f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2806,14 +2806,6 @@ no_journal:
 		goto failed_mount4;
 	};
 
-	/*
-	 * akpm: core read_super() calls in here with the superblock locked.
-	 * That deadlocks, because orphan cleanup needs to lock the superblock
-	 * in numerous places.  Here we just pop the lock - it's relatively
-	 * harmless, because we are now ready to accept write_super() requests,
-	 * and aviro says that's the only reason for hanging onto the
-	 * superblock lock.
-	 */
 	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
 	ext4_orphan_cleanup(sb, es);
 	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
  2009-04-26  3:49   ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Theodore Ts'o
@ 2009-04-26  3:49     ` Theodore Ts'o
  2009-04-26  3:49       ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Theodore Ts'o
  2009-04-26  7:07       ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Christoph Hellwig
  2009-04-28 16:13     ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Jan Kara
  1 sibling, 2 replies; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro,
	Theodore Ts'o

The function ext4_mark_recovery_complete() is called from two call
paths: either (a) while mounting the filesystem, in which case there's
no danger of any other CPU calling write_super() until the mount is
completed, and (b) while remounting the filesystem read-write, in
which case the fs core has already locked the superblock, and in any
case write_super() wouldn't be called until the filesystem is
successfully changed from being mounted read-only to read-write.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/super.c |   11 +----------
 1 files changed, 1 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e0b0c9f..176d43f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3198,14 +3198,12 @@ static void ext4_mark_recovery_complete(struct super_block *sb,
 	if (jbd2_journal_flush(journal) < 0)
 		goto out;
 
-	lock_super(sb);
 	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER) &&
 	    sb->s_flags & MS_RDONLY) {
 		EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
 		sb->s_dirt = 0;
 		ext4_commit_super(sb, es, 1);
 	}
-	unlock_super(sb);
 
 out:
 	jbd2_journal_unlock_updates(journal);
@@ -3438,15 +3436,8 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			    (sbi->s_mount_state & EXT4_VALID_FS))
 				es->s_state = cpu_to_le16(sbi->s_mount_state);
 
-			/*
-			 * We have to unlock super so that we can wait for
-			 * transactions.
-			 */
-			if (sbi->s_journal) {
-				unlock_super(sb);
+			if (sbi->s_journal)
 				ext4_mark_recovery_complete(sb, es);
-				lock_super(sb);
-			}
 		} else {
 			int ret;
 			if ((ret = EXT4_HAS_RO_COMPAT_FEATURE(sb,
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list
  2009-04-26  3:49     ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Theodore Ts'o
@ 2009-04-26  3:49       ` Theodore Ts'o
  2009-04-26  3:49         ` [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing Theodore Ts'o
  2009-04-28 15:52         ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Jan Kara
  2009-04-26  7:07       ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Christoph Hellwig
  1 sibling, 2 replies; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro,
	Theodore Ts'o

Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ext4_sb.h |    1 +
 fs/ext4/namei.c   |   20 +++++++++++---------
 fs/ext4/super.c   |    1 +
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 57b71fe..4bda2f7 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -71,6 +71,7 @@ struct ext4_sb_info {
 	struct inode *s_journal_inode;
 	struct journal_s *s_journal;
 	struct list_head s_orphan;
+	struct mutex s_orphan_lock;
 	unsigned long s_commit_interval;
 	u32 s_max_batch_time;
 	u32 s_min_batch_time;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 22098e1..8018e49 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1997,7 +1997,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	if (!ext4_handle_valid(handle))
 		return 0;
 
-	lock_super(sb);
+	mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
 	if (!list_empty(&EXT4_I(inode)->i_orphan))
 		goto out_unlock;
 
@@ -2006,9 +2006,13 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 
 	/* @@@ FIXME: Observation from aviro:
 	 * I think I can trigger J_ASSERT in ext4_orphan_add().  We block
-	 * here (on lock_super()), so race with ext4_link() which might bump
+	 * here (on s_orphan_lock), so race with ext4_link() which might bump
 	 * ->i_nlink. For, say it, character device. Not a regular file,
 	 * not a directory, not a symlink and ->i_nlink > 0.
+	 *
+	 * tytso, 4/25/2009: I'm not sure how that could happen;
+	 * shouldn't the fs core protect us from these sort of
+	 * unlink()/link() races?
 	 */
 	J_ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
 		  S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);
@@ -2045,7 +2049,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	jbd_debug(4, "orphan inode %lu will point to %d\n",
 			inode->i_ino, NEXT_ORPHAN(inode));
 out_unlock:
-	unlock_super(sb);
+	mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
 	ext4_std_error(inode->i_sb, err);
 	return err;
 }
@@ -2066,11 +2070,9 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	if (!ext4_handle_valid(handle))
 		return 0;
 
-	lock_super(inode->i_sb);
-	if (list_empty(&ei->i_orphan)) {
-		unlock_super(inode->i_sb);
-		return 0;
-	}
+	mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+	if (list_empty(&ei->i_orphan))
+		goto out;
 
 	ino_next = NEXT_ORPHAN(inode);
 	prev = ei->i_orphan.prev;
@@ -2120,7 +2122,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 out_err:
 	ext4_std_error(inode->i_sb, err);
 out:
-	unlock_super(inode->i_sb);
+	mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
 	return err;
 
 out_brelse:
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 176d43f..c23e82c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2623,6 +2623,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sb->dq_op = &ext4_quota_operations;
 #endif
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
+	mutex_init(&sbi->s_orphan_lock);
 
 	sb->s_root = NULL;
 
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing
  2009-04-26  3:49       ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Theodore Ts'o
@ 2009-04-26  3:49         ` Theodore Ts'o
  2009-04-28 16:02           ` Jan Kara
  2009-04-28 15:52         ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Jan Kara
  1 sibling, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2009-04-26  3:49 UTC (permalink / raw)
  To: Ext4 Developers List
  Cc: Linux Kernel Developers List, Christoph Hellwig, Al Viro,
	Theodore Ts'o

Use a separate lock to protect s_groups_count and the other block
group descriptors which get changed via an on-line resize operation,
so we can stop overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ext4_sb.h |    1 +
 fs/ext4/resize.c  |   35 ++++++++++++++++++-----------------
 fs/ext4/super.c   |    1 +
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 4bda2f7..2d36223 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -72,6 +72,7 @@ struct ext4_sb_info {
 	struct journal_s *s_journal;
 	struct list_head s_orphan;
 	struct mutex s_orphan_lock;
+	struct mutex s_resize_lock;
 	unsigned long s_commit_interval;
 	u32 s_max_batch_time;
 	u32 s_min_batch_time;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 546c7dd..e8ded13 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -193,7 +193,7 @@ static int setup_new_group_blocks(struct super_block *sb,
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
-	lock_super(sb);
+	mutex_lock(&sbi->s_resize_lock);
 	if (input->group != sbi->s_groups_count) {
 		err = -EBUSY;
 		goto exit_journal;
@@ -302,7 +302,7 @@ exit_bh:
 	brelse(bh);
 
 exit_journal:
-	unlock_super(sb);
+	mutex_unlock(&sbi->s_resize_lock);
 	if ((err2 = ext4_journal_stop(handle)) && !err)
 		err = err2;
 
@@ -643,11 +643,12 @@ exit_free:
  * important part is that the new block and inode counts are in the backup
  * superblocks, and the location of the new group metadata in the GDT backups.
  *
- * We do not need lock_super() for this, because these blocks are not
- * otherwise touched by the filesystem code when it is mounted.  We don't
- * need to worry about last changing from sbi->s_groups_count, because the
- * worst that can happen is that we do not copy the full number of backups
- * at this time.  The resize which changed s_groups_count will backup again.
+ * We do not need take the s_resize_lock for this, because these
+ * blocks are not otherwise touched by the filesystem code when it is
+ * mounted.  We don't need to worry about last changing from
+ * sbi->s_groups_count, because the worst that can happen is that we
+ * do not copy the full number of backups at this time.  The resize
+ * which changed s_groups_count will backup again.
  */
 static void update_backups(struct super_block *sb,
 			   int blk_off, char *data, int size)
@@ -809,7 +810,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 		goto exit_put;
 	}
 
-	lock_super(sb);
+	mutex_lock(&sbi->s_resize_lock);
 	if (input->group != sbi->s_groups_count) {
 		ext4_warning(sb, __func__,
 			     "multiple resizers run on filesystem!");
@@ -840,7 +841,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
         /*
          * OK, now we've set up the new group.  Time to make it active.
          *
-         * Current kernels don't lock all allocations via lock_super(),
+         * We do not lock all allocations via s_resize_lock
          * so we have to be safe wrt. concurrent accesses the group
          * data.  So we need to be careful to set all of the relevant
          * group descriptor data etc. *before* we enable the group.
@@ -900,12 +901,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 	 *
 	 * The precise rules we use are:
 	 *
-	 * * Writers of s_groups_count *must* hold lock_super
+	 * * Writers of s_groups_count *must* hold s_resize_lock
 	 * AND
 	 * * Writers must perform a smp_wmb() after updating all dependent
 	 *   data and before modifying the groups count
 	 *
-	 * * Readers must hold lock_super() over the access
+	 * * Readers must hold s_resize_lock over the access
 	 * OR
 	 * * Readers must perform an smp_rmb() after reading the groups count
 	 *   and before reading any dependent data.
@@ -948,7 +949,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 	sb->s_dirt = 1;
 
 exit_journal:
-	unlock_super(sb);
+	mutex_unlock(&sbi->s_resize_lock);
 	if ((err2 = ext4_journal_stop(handle)) && !err)
 		err = err2;
 	if (!err) {
@@ -986,7 +987,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
 
 	/* We don't need to worry about locking wrt other resizers just
 	 * yet: we're going to revalidate es->s_blocks_count after
-	 * taking lock_super() below. */
+	 * taking the s_resize_lock below. */
 	o_blocks_count = ext4_blocks_count(es);
 	o_groups_count = EXT4_SB(sb)->s_groups_count;
 
@@ -1056,11 +1057,11 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
 		goto exit_put;
 	}
 
-	lock_super(sb);
+	mutex_lock(&EXT4_SB(sb)->s_resize_lock);
 	if (o_blocks_count != ext4_blocks_count(es)) {
 		ext4_warning(sb, __func__,
 			     "multiple resizers run on filesystem!");
-		unlock_super(sb);
+		mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
 		ext4_journal_stop(handle);
 		err = -EBUSY;
 		goto exit_put;
@@ -1070,14 +1071,14 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
 						 EXT4_SB(sb)->s_sbh))) {
 		ext4_warning(sb, __func__,
 			     "error %d on journal write access", err);
-		unlock_super(sb);
+		mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
 		ext4_journal_stop(handle);
 		goto exit_put;
 	}
 	ext4_blocks_count_set(es, o_blocks_count + add);
 	ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
 	sb->s_dirt = 1;
-	unlock_super(sb);
+	mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
 	ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
 		   o_blocks_count + add);
 	/* We add the blocks to the bitmap and set the group need init bit */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c23e82c..b4735e4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2624,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #endif
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
 	mutex_init(&sbi->s_orphan_lock);
+	mutex_init(&sbi->s_resize_lock);
 
 	sb->s_root = NULL;
 
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
  2009-04-26  3:49     ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Theodore Ts'o
  2009-04-26  3:49       ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Theodore Ts'o
@ 2009-04-26  7:07       ` Christoph Hellwig
  2009-04-26 11:46         ` Theodore Tso
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2009-04-26  7:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

On Sat, Apr 25, 2009 at 11:49:23PM -0400, Theodore Ts'o wrote:
> The function ext4_mark_recovery_complete() is called from two call
> paths: either (a) while mounting the filesystem, in which case there's
> no danger of any other CPU calling write_super() until the mount is
> completed, and (b) while remounting the filesystem read-write, in
> which case the fs core has already locked the superblock, and in any
> case write_super() wouldn't be called until the filesystem is
> successfully changed from being mounted read-only to read-write.

Currently ext4_remount releases/reqacquires lock_super around
ext4_mark_recovery_complete, and unfortunately currently ->write_super
can be called on a r/o filesystem (that's why we have the MS_RDONLY
checks in all instance, I plan to clean that mess up).

So for now I would keep that instance of lock_super, it'll also serve as
documentation for the locking requirements once we pushed down
lock_super to the filesystem in ->write_super and ->remount_fs so that
it can eventually be replaced with an ext4-local lock.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
  2009-04-26  7:07       ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Christoph Hellwig
@ 2009-04-26 11:46         ` Theodore Tso
  2009-04-26 11:49           ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Tso @ 2009-04-26 11:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ext4 Developers List, Linux Kernel Developers List, Al Viro

On Sun, Apr 26, 2009 at 03:07:14AM -0400, Christoph Hellwig wrote:
> On Sat, Apr 25, 2009 at 11:49:23PM -0400, Theodore Ts'o wrote:
> > The function ext4_mark_recovery_complete() is called from two call
> > paths: either (a) while mounting the filesystem, in which case there's
> > no danger of any other CPU calling write_super() until the mount is
> > completed, and (b) while remounting the filesystem read-write, in
> > which case the fs core has already locked the superblock, and in any
> > case write_super() wouldn't be called until the filesystem is
> > successfully changed from being mounted read-only to read-write.
> 
> Currently ext4_remount releases/reqacquires lock_super around
> ext4_mark_recovery_complete, and unfortunately currently ->write_super
> can be called on a r/o filesystem (that's why we have the MS_RDONLY
> checks in all instance, I plan to clean that mess up).

That's true, but the patch also takes out the release/reacquire in in
ext4_remount (which was particularly ugly, belch).  So even if
write_super gets called on an r/o filesystem (why?!?), we should be
safe because remount will hold lock_super() throughout the entire
remount operation.

We could delay this cleanup until you clean the mess with write_super,
but I don't think it would be harmful in removing the
lock_super()/unlock_super() pair in ext4_mark_recovery_complete(), and
the unlock_super()/lock_super() pair in ext4_remount before then.  Am
I missing something?

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
  2009-04-26 11:46         ` Theodore Tso
@ 2009-04-26 11:49           ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2009-04-26 11:49 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Ext4 Developers List,
	Linux Kernel Developers List, Al Viro

On Sun, Apr 26, 2009 at 07:46:08AM -0400, Theodore Tso wrote:
> That's true, but the patch also takes out the release/reacquire in in
> ext4_remount (which was particularly ugly, belch).

Sorry, missed the second hunk of the patch.

> So even if
> write_super gets called on an r/o filesystem (why?!?),

No good reason really.  Hopefully we'll sort all that out soon.

> we should be
> safe because remount will hold lock_super() throughout the entire
> remount operation.
> 
> We could delay this cleanup until you clean the mess with write_super,
> but I don't think it would be harmful in removing the
> lock_super()/unlock_super() pair in ext4_mark_recovery_complete(), and
> the unlock_super()/lock_super() pair in ext4_remount before then.  Am
> I missing something?

No, I was just missing the second hunk of the patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list
  2009-04-26  3:49       ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Theodore Ts'o
  2009-04-26  3:49         ` [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing Theodore Ts'o
@ 2009-04-28 15:52         ` Jan Kara
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Kara @ 2009-04-28 15:52 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

  Hi,

> Use a separate lock to protect the orphan list, so we can stop
> overloading the use of lock_super().
  Yes, this was needed for a long time.

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/ext4_sb.h |    1 +
>  fs/ext4/namei.c   |   20 +++++++++++---------
>  fs/ext4/super.c   |    1 +
>  3 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
> index 57b71fe..4bda2f7 100644
> --- a/fs/ext4/ext4_sb.h
> +++ b/fs/ext4/ext4_sb.h
> @@ -71,6 +71,7 @@ struct ext4_sb_info {
>  	struct inode *s_journal_inode;
>  	struct journal_s *s_journal;
>  	struct list_head s_orphan;
> +	struct mutex s_orphan_lock;
>  	unsigned long s_commit_interval;
>  	u32 s_max_batch_time;
>  	u32 s_min_batch_time;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 22098e1..8018e49 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1997,7 +1997,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  	if (!ext4_handle_valid(handle))
>  		return 0;
>  
> -	lock_super(sb);
> +	mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
>  	if (!list_empty(&EXT4_I(inode)->i_orphan))
>  		goto out_unlock;
>  
> @@ -2006,9 +2006,13 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  
>  	/* @@@ FIXME: Observation from aviro:
>  	 * I think I can trigger J_ASSERT in ext4_orphan_add().  We block
> -	 * here (on lock_super()), so race with ext4_link() which might bump
> +	 * here (on s_orphan_lock), so race with ext4_link() which might bump
>  	 * ->i_nlink. For, say it, character device. Not a regular file,
>  	 * not a directory, not a symlink and ->i_nlink > 0.
> +	 *
> +	 * tytso, 4/25/2009: I'm not sure how that could happen;
> +	 * shouldn't the fs core protect us from these sort of
> +	 * unlink()/link() races?
>  	 */
  We always call ext4_orphan_add() under i_mutex of the inode we are
adding (except for migrate code, well) and hence i_nlink should better
be stable... I'd just remove the comment.

>  	J_ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
>  		  S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);
> @@ -2045,7 +2049,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>  	jbd_debug(4, "orphan inode %lu will point to %d\n",
>  			inode->i_ino, NEXT_ORPHAN(inode));
>  out_unlock:
> -	unlock_super(sb);
> +	mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
>  	ext4_std_error(inode->i_sb, err);
>  	return err;
>  }
> @@ -2066,11 +2070,9 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
>  	if (!ext4_handle_valid(handle))
>  		return 0;
>  
> -	lock_super(inode->i_sb);
> -	if (list_empty(&ei->i_orphan)) {
> -		unlock_super(inode->i_sb);
> -		return 0;
> -	}
> +	mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> +	if (list_empty(&ei->i_orphan))
> +		goto out;
>  
>  	ino_next = NEXT_ORPHAN(inode);
>  	prev = ei->i_orphan.prev;
> @@ -2120,7 +2122,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
>  out_err:
>  	ext4_std_error(inode->i_sb, err);
>  out:
> -	unlock_super(inode->i_sb);
> +	mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
>  	return err;
>  
>  out_brelse:
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 176d43f..c23e82c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2623,6 +2623,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->dq_op = &ext4_quota_operations;
>  #endif
>  	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> +	mutex_init(&sbi->s_orphan_lock);
>  
>  	sb->s_root = NULL;
  Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing
  2009-04-26  3:49         ` [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing Theodore Ts'o
@ 2009-04-28 16:02           ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2009-04-28 16:02 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

> Use a separate lock to protect s_groups_count and the other block
> group descriptors which get changed via an on-line resize operation,
> so we can stop overloading the use of lock_super().
  This patch looks good as well. You can add
Acked-by: Jan Kara <jack@suse.cz>
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/ext4_sb.h |    1 +
>  fs/ext4/resize.c  |   35 ++++++++++++++++++-----------------
>  fs/ext4/super.c   |    1 +
>  3 files changed, 20 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
> index 4bda2f7..2d36223 100644
> --- a/fs/ext4/ext4_sb.h
> +++ b/fs/ext4/ext4_sb.h
> @@ -72,6 +72,7 @@ struct ext4_sb_info {
>  	struct journal_s *s_journal;
>  	struct list_head s_orphan;
>  	struct mutex s_orphan_lock;
> +	struct mutex s_resize_lock;
>  	unsigned long s_commit_interval;
>  	u32 s_max_batch_time;
>  	u32 s_min_batch_time;
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index 546c7dd..e8ded13 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -193,7 +193,7 @@ static int setup_new_group_blocks(struct super_block *sb,
>  	if (IS_ERR(handle))
>  		return PTR_ERR(handle);
>  
> -	lock_super(sb);
> +	mutex_lock(&sbi->s_resize_lock);
>  	if (input->group != sbi->s_groups_count) {
>  		err = -EBUSY;
>  		goto exit_journal;
> @@ -302,7 +302,7 @@ exit_bh:
>  	brelse(bh);
>  
>  exit_journal:
> -	unlock_super(sb);
> +	mutex_unlock(&sbi->s_resize_lock);
>  	if ((err2 = ext4_journal_stop(handle)) && !err)
>  		err = err2;
>  
> @@ -643,11 +643,12 @@ exit_free:
>   * important part is that the new block and inode counts are in the backup
>   * superblocks, and the location of the new group metadata in the GDT backups.
>   *
> - * We do not need lock_super() for this, because these blocks are not
> - * otherwise touched by the filesystem code when it is mounted.  We don't
> - * need to worry about last changing from sbi->s_groups_count, because the
> - * worst that can happen is that we do not copy the full number of backups
> - * at this time.  The resize which changed s_groups_count will backup again.
> + * We do not need take the s_resize_lock for this, because these
> + * blocks are not otherwise touched by the filesystem code when it is
> + * mounted.  We don't need to worry about last changing from
> + * sbi->s_groups_count, because the worst that can happen is that we
> + * do not copy the full number of backups at this time.  The resize
> + * which changed s_groups_count will backup again.
>   */
>  static void update_backups(struct super_block *sb,
>  			   int blk_off, char *data, int size)
> @@ -809,7 +810,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
>  		goto exit_put;
>  	}
>  
> -	lock_super(sb);
> +	mutex_lock(&sbi->s_resize_lock);
>  	if (input->group != sbi->s_groups_count) {
>  		ext4_warning(sb, __func__,
>  			     "multiple resizers run on filesystem!");
> @@ -840,7 +841,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
>          /*
>           * OK, now we've set up the new group.  Time to make it active.
>           *
> -         * Current kernels don't lock all allocations via lock_super(),
> +         * We do not lock all allocations via s_resize_lock
>           * so we have to be safe wrt. concurrent accesses the group
>           * data.  So we need to be careful to set all of the relevant
>           * group descriptor data etc. *before* we enable the group.
> @@ -900,12 +901,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
>  	 *
>  	 * The precise rules we use are:
>  	 *
> -	 * * Writers of s_groups_count *must* hold lock_super
> +	 * * Writers of s_groups_count *must* hold s_resize_lock
>  	 * AND
>  	 * * Writers must perform a smp_wmb() after updating all dependent
>  	 *   data and before modifying the groups count
>  	 *
> -	 * * Readers must hold lock_super() over the access
> +	 * * Readers must hold s_resize_lock over the access
>  	 * OR
>  	 * * Readers must perform an smp_rmb() after reading the groups count
>  	 *   and before reading any dependent data.
> @@ -948,7 +949,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
>  	sb->s_dirt = 1;
>  
>  exit_journal:
> -	unlock_super(sb);
> +	mutex_unlock(&sbi->s_resize_lock);
>  	if ((err2 = ext4_journal_stop(handle)) && !err)
>  		err = err2;
>  	if (!err) {
> @@ -986,7 +987,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
>  
>  	/* We don't need to worry about locking wrt other resizers just
>  	 * yet: we're going to revalidate es->s_blocks_count after
> -	 * taking lock_super() below. */
> +	 * taking the s_resize_lock below. */
>  	o_blocks_count = ext4_blocks_count(es);
>  	o_groups_count = EXT4_SB(sb)->s_groups_count;
>  
> @@ -1056,11 +1057,11 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
>  		goto exit_put;
>  	}
>  
> -	lock_super(sb);
> +	mutex_lock(&EXT4_SB(sb)->s_resize_lock);
>  	if (o_blocks_count != ext4_blocks_count(es)) {
>  		ext4_warning(sb, __func__,
>  			     "multiple resizers run on filesystem!");
> -		unlock_super(sb);
> +		mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
>  		ext4_journal_stop(handle);
>  		err = -EBUSY;
>  		goto exit_put;
> @@ -1070,14 +1071,14 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
>  						 EXT4_SB(sb)->s_sbh))) {
>  		ext4_warning(sb, __func__,
>  			     "error %d on journal write access", err);
> -		unlock_super(sb);
> +		mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
>  		ext4_journal_stop(handle);
>  		goto exit_put;
>  	}
>  	ext4_blocks_count_set(es, o_blocks_count + add);
>  	ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
>  	sb->s_dirt = 1;
> -	unlock_super(sb);
> +	mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
>  	ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
>  		   o_blocks_count + add);
>  	/* We add the blocks to the bitmap and set the group need init bit */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c23e82c..b4735e4 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2624,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  #endif
>  	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
>  	mutex_init(&sbi->s_orphan_lock);
> +	mutex_init(&sbi->s_resize_lock);
>  
>  	sb->s_root = NULL;
>  
> -- 
> 1.5.6.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] ext4: Remove outdated comment about lock_super()
  2009-04-26  3:49   ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Theodore Ts'o
  2009-04-26  3:49     ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Theodore Ts'o
@ 2009-04-28 16:13     ` Jan Kara
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Kara @ 2009-04-28 16:13 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List, Al Viro,
	Christoph Hellwig

> ext4_fill_super() is no longer called by read_super(), and it is no
> longer called with the superblock locked.  The
> unlock_super()/lock_super() is no longer present, so this comment is
> entirely superfluous.
  Yes, the comment seems to be out of date.

Acked-by: Jan Kara <jack@suse.cz>

									Honza
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/super.c |    8 --------
>  1 files changed, 0 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e20ff9c..e0b0c9f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2806,14 +2806,6 @@ no_journal:
>  		goto failed_mount4;
>  	};
>  
> -	/*
> -	 * akpm: core read_super() calls in here with the superblock locked.
> -	 * That deadlocks, because orphan cleanup needs to lock the superblock
> -	 * in numerous places.  Here we just pop the lock - it's relatively
> -	 * harmless, because we are now ready to accept write_super() requests,
> -	 * and aviro says that's the only reason for hanging onto the
> -	 * superblock lock.
> -	 */
>  	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
>  	ext4_orphan_cleanup(sb, es);
>  	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
> -- 
> 1.5.6.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering
  2009-04-26  3:49 ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Theodore Ts'o
  2009-04-26  3:49   ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Theodore Ts'o
@ 2009-04-28 16:23   ` Jan Kara
  2009-04-28 17:14     ` Theodore Tso
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2009-04-28 16:23 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

> Ext4's on-line resizing adds a new block group and then, only at the
> last step adjusts s_groups_count.  However, it's possible on SMP
> systems that another CPU could see the updated the s_group_count and
> not see the newly initialized data structures for the just-added block
> group.  For this reason, it's important to insert a SMP read barrier
> after reading s_groups_count and before reading any, say, block group
> descriptors allowed by the block group count.
> 
> Unfortunately, we rather blatently violate this locking protocol as
> documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
> happen relatively rarely, and (2) it seems rare that the filesystem
> code will immediately try to use just-added block group before any
> memory ordering issues resolve themselves.  So apparently problems
> here are relatively hard to hit, since ext3 also is vulnerable to this
> race and no one has apparently complained.
  Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
ugly and prone to errors (I'm afraid next time someone changes the
allocation code, we miss some barriers again)... so.. Maybe a stupid
idea but wouldn't it be easier to solve the online resize like: freeze
the filesystem, do all the changes required for extend, unfreeze the
filesystem?
  I guess the resize code might get simpler as well with this.

								Honza
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/balloc.c  |    6 ++++--
>  fs/ext4/ialloc.c  |   34 +++++++++++++++++++++-------------
>  fs/ext4/mballoc.c |   49 ++++++++++++++++++++++++++++++++-----------------
>  3 files changed, 57 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 53c72ad..d1615f2 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -88,9 +88,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
>  		 ext4_group_t block_group, struct ext4_group_desc *gdp)
>  {
>  	int bit, bit_max;
> +	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
>  	unsigned free_blocks, group_blocks;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	if (bh) {
>  		J_ASSERT_BH(bh, buffer_locked(bh));
>  
> @@ -123,7 +125,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
>  		bit_max += ext4_bg_num_gdb(sb, block_group);
>  	}
>  
> -	if (block_group == sbi->s_groups_count - 1) {
> +	if (block_group == ngroups - 1) {
>  		/*
>  		 * Even though mke2fs always initialize first and last group
>  		 * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
> @@ -131,7 +133,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
>  		 */
>  		group_blocks = ext4_blocks_count(sbi->s_es) -
>  			le32_to_cpu(sbi->s_es->s_first_data_block) -
> -			(EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
> +			(EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
>  	} else {
>  		group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
>  	}
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f18e0a0..52ce274 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -322,6 +322,7 @@ static int find_group_dir(struct super_block *sb, struct inode *parent,
>  	ext4_group_t group;
>  	int ret = -1;
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
>  	avefreei = freei / ngroups;
>  
> @@ -362,7 +363,8 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
>  	ext4_group_t n_fbg_groups;
>  	ext4_group_t i;
>  
> -	n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
> +	smp_rmb();		/* after reading s_groups_count first */
> +	n_fbg_groups = (ngroups + flex_size - 1) >>
>  		sbi->s_log_groups_per_flex;
>  
>  find_close_to_parent:
> @@ -478,18 +480,18 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
>  {
>  	ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	ext4_group_t ngroups = sbi->s_groups_count;
>  	int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
>  	unsigned int freei, avefreei;
>  	ext4_fsblk_t freeb, avefreeb;
>  	unsigned int ndirs;
>  	int max_dirs, min_inodes;
>  	ext4_grpblk_t min_blocks;
> -	ext4_group_t i, grp, g;
> +	ext4_group_t i, grp, g, ngroups = sbi->s_groups_count;;
>  	struct ext4_group_desc *desc;
>  	struct orlov_stats stats;
>  	int flex_size = ext4_flex_bg_size(sbi);
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	if (flex_size > 1) {
>  		ngroups = (ngroups + flex_size - 1) >>
>  			sbi->s_log_groups_per_flex;
> @@ -585,6 +587,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
>  fallback:
>  	ngroups = sbi->s_groups_count;
>  	avefreei = freei / ngroups;
> +	smp_rmb();
>  fallback_retry:
>  	parent_group = EXT4_I(parent)->i_block_group;
>  	for (i = 0; i < ngroups; i++) {
> @@ -613,11 +616,11 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
>  			    ext4_group_t *group, int mode)
>  {
>  	ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> -	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> +	ext4_group_t i, last, ngroups = EXT4_SB(sb)->s_groups_count;
>  	struct ext4_group_desc *desc;
> -	ext4_group_t i, last;
>  	int flex_size = ext4_flex_bg_size(EXT4_SB(sb));
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	/*
>  	 * Try to place the inode is the same flex group as its
>  	 * parent.  If we can't find space, use the Orlov algorithm to
> @@ -799,7 +802,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
>  	struct super_block *sb;
>  	struct buffer_head *inode_bitmap_bh = NULL;
>  	struct buffer_head *group_desc_bh;
> -	ext4_group_t group = 0;
> +	ext4_group_t ngroups, group = 0;
>  	unsigned long ino = 0;
>  	struct inode *inode;
>  	struct ext4_group_desc *gdp = NULL;
> @@ -851,12 +854,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
>  		ret2 = find_group_other(sb, dir, &group, mode);
>  
>  got_group:
> +	ngroups = sbi->s_groups_count;
> +	smp_rmb();
>  	EXT4_I(dir)->i_last_alloc_group = group;
>  	err = -ENOSPC;
>  	if (ret2 == -1)
>  		goto out;
>  
> -	for (i = 0; i < sbi->s_groups_count; i++) {
> +	for (i = 0; i < ngroups; i++) {
>  		err = -EIO;
>  
>  		gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> @@ -917,7 +922,7 @@ repeat_in_this_group:
>  		 * group descriptor metadata has not yet been updated.
>  		 * So we just go onto the next blockgroup.
>  		 */
> -		if (++group == sbi->s_groups_count)
> +		if (++group == ngroups)
>  			group = 0;
>  	}
>  	err = -ENOSPC;
> @@ -1158,17 +1163,18 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
>  {
>  	unsigned long desc_count;
>  	struct ext4_group_desc *gdp;
> -	ext4_group_t i;
> +	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
>  #ifdef EXT4FS_DEBUG
>  	struct ext4_super_block *es;
>  	unsigned long bitmap_count, x;
>  	struct buffer_head *bitmap_bh = NULL;
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	es = EXT4_SB(sb)->s_es;
>  	desc_count = 0;
>  	bitmap_count = 0;
>  	gdp = NULL;
> -	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> +	for (i = 0; i < ngroups; i++) {
>  		gdp = ext4_get_group_desc(sb, i, NULL);
>  		if (!gdp)
>  			continue;
> @@ -1189,8 +1195,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
>  	       le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count);
>  	return desc_count;
>  #else
> +	smp_rmb();		/* after reading s_groups_count first */
>  	desc_count = 0;
> -	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> +	for (i = 0; i < ngroups; i++) {
>  		gdp = ext4_get_group_desc(sb, i, NULL);
>  		if (!gdp)
>  			continue;
> @@ -1205,9 +1212,10 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
>  unsigned long ext4_count_dirs(struct super_block * sb)
>  {
>  	unsigned long count = 0;
> -	ext4_group_t i;
> +	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
>  
> -	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> +	smp_rmb();		/* after reading s_groups_count first */
> +	for (i = 0; i < ngroups; i++) {
>  		struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
>  		if (!gdp)
>  			continue;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f871677..ecc2d49 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
>  
>  static int ext4_mb_init_cache(struct page *page, char *incore)
>  {
> +	ext4_group_t ngroups;
>  	int blocksize;
>  	int blocks_per_page;
>  	int groups_per_page;
> @@ -757,6 +758,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>  
>  	inode = page->mapping->host;
>  	sb = inode->i_sb;
> +	ngroups = EXT4_SB(sb)->s_groups_count;
> +	smp_rmb();
>  	blocksize = 1 << inode->i_blkbits;
>  	blocks_per_page = PAGE_CACHE_SIZE / blocksize;
>  
> @@ -780,7 +783,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>  	for (i = 0; i < groups_per_page; i++) {
>  		struct ext4_group_desc *desc;
>  
> -		if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> +		if (first_group + i >= ngroups)
>  			break;
>  
>  		err = -EIO;
> @@ -852,7 +855,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>  		struct ext4_group_info *grinfo;
>  
>  		group = (first_block + i) >> 1;
> -		if (group >= EXT4_SB(sb)->s_groups_count)
> +		if (group >= ngroups)
>  			break;
>  
>  		/*
> @@ -1788,9 +1791,11 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
>  	int block, pnum;
>  	int blocks_per_page;
>  	int groups_per_page;
> +	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
>  	ext4_group_t first_group;
>  	struct ext4_group_info *grp;
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
>  	/*
>  	 * the buddy cache inode stores the block bitmap
> @@ -1807,7 +1812,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
>  	/* read all groups the page covers into the cache */
>  	for (i = 0; i < groups_per_page; i++) {
>  
> -		if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
> +		if ((first_group + i) >= ngroups)
>  			break;
>  		grp = ext4_get_group_info(sb, first_group + i);
>  		/* take all groups write allocation
> @@ -1945,8 +1950,7 @@ err:
>  static noinline_for_stack int
>  ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  {
> -	ext4_group_t group;
> -	ext4_group_t i;
> +	ext4_group_t ngroups, group, i;
>  	int cr;
>  	int err = 0;
>  	int bsbits;
> @@ -1957,6 +1961,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  
>  	sb = ac->ac_sb;
>  	sbi = EXT4_SB(sb);
> +	ngroups = EXT4_SB(sb)->s_groups_count;
> +	smp_rmb();
>  	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>  
>  	/* first, try the goal */
> @@ -2017,11 +2023,11 @@ repeat:
>  		 */
>  		group = ac->ac_g_ex.fe_group;
>  
> -		for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
> +		for (i = 0; i < ngroups; group++, i++) {
>  			struct ext4_group_info *grp;
>  			struct ext4_group_desc *desc;
>  
> -			if (group == EXT4_SB(sb)->s_groups_count)
> +			if (group == ngroups)
>  				group = 0;
>  
>  			/* quick check to skip empty groups */
> @@ -2320,7 +2326,7 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
>  
>  	if (*pos < 0 || *pos >= sbi->s_groups_count)
>  		return NULL;
> -
> +	smp_rmb();
>  	group = *pos + 1;
>  	return (void *) ((unsigned long) group);
>  }
> @@ -2334,6 +2340,7 @@ static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
>  	++*pos;
>  	if (*pos < 0 || *pos >= sbi->s_groups_count)
>  		return NULL;
> +	smp_rmb();
>  	group = *pos + 1;
>  	return (void *) ((unsigned long) group);
>  }
> @@ -2587,6 +2594,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)
>  
>  static int ext4_mb_init_backend(struct super_block *sb)
>  {
> +	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
>  	ext4_group_t i;
>  	int metalen;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -2597,8 +2605,10 @@ static int ext4_mb_init_backend(struct super_block *sb)
>  	struct ext4_group_info **meta_group_info;
>  	struct ext4_group_desc *desc;
>  
> +	smp_rmb();		/* after reading s_groups_count first */
> +
>  	/* This is the number of blocks used by GDT */
> -	num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
> +	num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
>  				1) >> EXT4_DESC_PER_BLOCK_BITS(sb);
>  
>  	/*
> @@ -2644,7 +2654,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
>  	for (i = 0; i < num_meta_group_infos; i++) {
>  		if ((i + 1) == num_meta_group_infos)
>  			metalen = sizeof(*meta_group_info) *
> -				(sbi->s_groups_count -
> +				(ngroups -
>  					(i << EXT4_DESC_PER_BLOCK_BITS(sb)));
>  		meta_group_info = kmalloc(metalen, GFP_KERNEL);
>  		if (meta_group_info == NULL) {
> @@ -2655,7 +2665,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
>  		sbi->s_group_info[i] = meta_group_info;
>  	}
>  
> -	for (i = 0; i < sbi->s_groups_count; i++) {
> +	for (i = 0; i < ngroups; i++) {
>  		desc = ext4_get_group_desc(sb, i, NULL);
>  		if (desc == NULL) {
>  			printk(KERN_ERR
> @@ -2781,13 +2791,15 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
>  
>  int ext4_mb_release(struct super_block *sb)
>  {
> +	ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
>  	ext4_group_t i;
>  	int num_meta_group_infos;
>  	struct ext4_group_info *grinfo;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	if (sbi->s_group_info) {
> -		for (i = 0; i < sbi->s_groups_count; i++) {
> +		for (i = 0; i < ngroups; i++) {
>  			grinfo = ext4_get_group_info(sb, i);
>  #ifdef DOUBLE_CHECK
>  			kfree(grinfo->bb_bitmap);
> @@ -2797,7 +2809,7 @@ int ext4_mb_release(struct super_block *sb)
>  			ext4_unlock_group(sb, i);
>  			kfree(grinfo);
>  		}
> -		num_meta_group_infos = (sbi->s_groups_count +
> +		num_meta_group_infos = (ngroups +
>  				EXT4_DESC_PER_BLOCK(sb) - 1) >>
>  			EXT4_DESC_PER_BLOCK_BITS(sb);
>  		for (i = 0; i < num_meta_group_infos; i++)
> @@ -4121,7 +4133,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
>  static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
>  {
>  	struct super_block *sb = ac->ac_sb;
> -	ext4_group_t i;
> +	ext4_group_t ngroups, i;
>  
>  	printk(KERN_ERR "EXT4-fs: Can't allocate:"
>  			" Allocation context details:\n");
> @@ -4145,7 +4157,9 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
>  	printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
>  		ac->ac_found);
>  	printk(KERN_ERR "EXT4-fs: groups: \n");
> -	for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> +	ngroups = EXT4_SB(ac->ac_sb)->s_groups_count;
> +	smp_rmb();
> +	for (i = 0; i < EXT4_SB(sb)->ngroups; i++) {
>  		struct ext4_group_info *grp = ext4_get_group_info(sb, i);
>  		struct ext4_prealloc_space *pa;
>  		ext4_grpblk_t start;
> @@ -4469,13 +4483,14 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>  
>  static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>  {
> -	ext4_group_t i;
> +	ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
>  	int ret;
>  	int freed = 0;
>  
> +	smp_rmb();		/* after reading s_groups_count first */
>  	trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
>  		   sb->s_id, needed);
> -	for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
> +	for (i = 0; i < ngroups && needed > 0; i++) {
>  		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
>  		freed += ret;
>  		needed -= ret;
> -- 
> 1.5.6.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering
  2009-04-28 16:23   ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Jan Kara
@ 2009-04-28 17:14     ` Theodore Tso
  2009-04-29  9:28       ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Tso @ 2009-04-28 17:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

On Tue, Apr 28, 2009 at 06:23:59PM +0200, Jan Kara wrote:
>   Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
> ugly and prone to errors (I'm afraid next time someone changes the
> allocation code, we miss some barriers again)... so.. Maybe a stupid
> idea but wouldn't it be easier to solve the online resize like: freeze
> the filesystem, do all the changes required for extend, unfreeze the
> filesystem?

Eric suggested a helper function when reading from s_groups_count.
That would take care of the code maintenance headache.  One problem
with using freeze/thaw is it won't work without a journal, and we do
support filesystems without journals in ext4.  (Probably the best
solution for netbooks with crapola SSD's.)

As far as smb_rmb() not being free, it's essentially free for
x86/x86_64 platforms.  Is it really that costly on other
architectures?

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering
  2009-04-28 17:14     ` Theodore Tso
@ 2009-04-29  9:28       ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2009-04-29  9:28 UTC (permalink / raw)
  To: Theodore Tso, Ext4 Developers List, Linux Kernel Developers List,
	Christoph Hellwig, Al Viro

On Tue 28-04-09 13:14:45, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 06:23:59PM +0200, Jan Kara wrote:
> >   Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
> > ugly and prone to errors (I'm afraid next time someone changes the
> > allocation code, we miss some barriers again)... so.. Maybe a stupid
> > idea but wouldn't it be easier to solve the online resize like: freeze
> > the filesystem, do all the changes required for extend, unfreeze the
> > filesystem?
> 
> Eric suggested a helper function when reading from s_groups_count.
> That would take care of the code maintenance headache.  One problem
> with using freeze/thaw is it won't work without a journal, and we do
> support filesystems without journals in ext4.  (Probably the best
> solution for netbooks with crapola SSD's.)
  Well, in non-journalling case, we could introduce a rw semaphore
(read acquired / released in journal_start / journal_stop, write acquired
when the fs is frozen). This might be useful for other rare cases where
freezing the fs would be beneficial. But yes, if wrapping into a helper
function works then that might be the easiest way to go.

> As far as smb_rmb() not being free, it's essentially free for
> x86/x86_64 platforms.  Is it really that costly on other
> architectures?
  I had a feeling that it's not that expensive but not quite free either on
x86/x86_64 (I know even less about other archs) - it has to lock the bus,
writes in local CPU caches have to be flushed, no? Probably we don't care
given the size of the functions doing allocation... As an excercise I was
trying to google some numbers but was not really successful, just some
comments about tens of cycles in some emails...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-04-29  9:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-26  3:49 [PATCH RFC 0/5] Eliminate most lock_super() calls from ext4 Theodore Ts'o
2009-04-26  3:49 ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Theodore Ts'o
2009-04-26  3:49   ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Theodore Ts'o
2009-04-26  3:49     ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Theodore Ts'o
2009-04-26  3:49       ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Theodore Ts'o
2009-04-26  3:49         ` [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing Theodore Ts'o
2009-04-28 16:02           ` Jan Kara
2009-04-28 15:52         ` [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list Jan Kara
2009-04-26  7:07       ` [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super Christoph Hellwig
2009-04-26 11:46         ` Theodore Tso
2009-04-26 11:49           ` Christoph Hellwig
2009-04-28 16:13     ` [PATCH 2/5] ext4: Remove outdated comment about lock_super() Jan Kara
2009-04-28 16:23   ` [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering Jan Kara
2009-04-28 17:14     ` Theodore Tso
2009-04-29  9:28       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).