linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@)
@ 2012-01-06 14:37 Boaz Harrosh
  2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:37 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap

October's large testing has unearthed some nasty bugs. Do to
my Deficiency I have failed to send them for the 3.2 Kernel,

so here they are for the 3.3 Merge window, and also for
the Stable@ tree.

Sorry!

[PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
	Random Kconfig breakage found by Randy Dunlap. Thanks Randy.

[PATCH 2/4] ore: Fix crash in case of an IO error.
[PATCH 3/4] ore: fix BUG_ON, too few sgs when reading
[PATCH 4/4] ore: Must support none-PAGE-aligned IO
	All these are BUG_ON(s) and crashes. Most interesting
	is the last one which proves that the 3.2 RAID engine
	was a bit premature.

After this code both exofs and pNFS pass tests including RAID5
verification. Cheers! (exofs did so before pNFS didn't)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
  2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
@ 2012-01-06 14:40 ` Boaz Harrosh
  2012-01-07 18:19   ` Randy Dunlap
  2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:40 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap


As Reported by Randy Dunlap

When MISC_FILESYSTEMS is not enabled and NFS4.1 is:

fs/built-in.o: In function `objio_alloc_io_state':
objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
fs/built-in.o: In function `_write_done':
objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
fs/built-in.o: In function `_read_done':
...

When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
is not selected. exofs/Kconfig is never examined during Kconfig,
and it can not do it's magic stuff to automatically select everything
needed.

We must split exofs/Kconfig in two. The ore one is always included.
And the exofs one is left in it's old place in the menu.

[Needed for the 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/Kconfig           |    2 ++
 fs/exofs/Kconfig     |   11 -----------
 fs/exofs/Kconfig.ore |   12 ++++++++++++
 3 files changed, 14 insertions(+), 11 deletions(-)
 create mode 100644 fs/exofs/Kconfig.ore

diff --git a/fs/Kconfig b/fs/Kconfig
index 5f4c45d..6ad58a5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -218,6 +218,8 @@ source "fs/exofs/Kconfig"
 
 endif # MISC_FILESYSTEMS
 
+source "fs/exofs/Kconfig.ore"
+
 menuconfig NETWORK_FILESYSTEMS
 	bool "Network File Systems"
 	default y
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
index da42f32..86194b2 100644
--- a/fs/exofs/Kconfig
+++ b/fs/exofs/Kconfig
@@ -1,14 +1,3 @@
-# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
-# for every ORE user we do it like this. Any user should add itself here
-# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
-# selected here, and we default to "ON". So in effect it is like been
-# selected by any of the users.
-config ORE
-	tristate
-	depends on EXOFS_FS || PNFS_OBJLAYOUT
-	select ASYNC_XOR
-	default SCSI_OSD_ULD
-
 config EXOFS_FS
 	tristate "exofs: OSD based file system support"
 	depends on SCSI_OSD_ULD
diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore
new file mode 100644
index 0000000..1ca7fb7
--- /dev/null
+++ b/fs/exofs/Kconfig.ore
@@ -0,0 +1,12 @@
+# ORE - Objects Raid Engine (libore.ko)
+#
+# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
+# for every ORE user we do it like this. Any user should add itself here
+# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
+# selected here, and we default to "ON". So in effect it is like been
+# selected by any of the users.
+config ORE
+	tristate
+	depends on EXOFS_FS || PNFS_OBJLAYOUT
+	select ASYNC_XOR
+	default SCSI_OSD_ULD
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] ore: Fix crash in case of an IO error.
  2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
  2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
@ 2012-01-06 14:42 ` Boaz Harrosh
  2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
  2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
  3 siblings, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:42 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree


The users of ore_check_io() expect the reported device
(In case of error) to be indexed relative to the passed-in
ore_components table, and not the logical dev index.

This causes a crash inside objlayoutdriver in case of
an IO error.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/ore.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index d271ad8..894f3e1 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -445,10 +445,10 @@ int ore_check_io(struct ore_io_state *ios, ore_on_dev_error on_dev_error)
 			u64 residual = ios->reading ?
 					or->in.residual : or->out.residual;
 			u64 offset = (ios->offset + ios->length) - residual;
-			struct ore_dev *od = ios->oc->ods[
-					per_dev->dev - ios->oc->first_dev];
+			unsigned dev = per_dev->dev - ios->oc->first_dev;
+			struct ore_dev *od = ios->oc->ods[dev];
 
-			on_dev_error(ios, od, per_dev->dev, osi.osd_err_pri,
+			on_dev_error(ios, od, dev, osi.osd_err_pri,
 				     offset, residual);
 		}
 		if (osi.osd_err_pri >= acumulated_osd_err) {
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading
  2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
  2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
  2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
@ 2012-01-06 14:43 ` Boaz Harrosh
  2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
  3 siblings, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:43 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree


When reading RAID5 files, in rare cases, we calculated too
few sg segments. There should be two extra for the beginning
and end partial units.

Also "too few sg segments" should not be a BUG_ON there is
all the mechanics in place to handle it, as a short read.
So just return -ENOMEM and the rest of the code will gracefully
split the IO.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/ore.c      |    2 +-
 fs/exofs/ore_raid.c |    6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 894f3e1..49cf230 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -266,7 +266,7 @@ int  ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
 
 			/* first/last seg is split */
 			num_raid_units += layout->group_width;
-			sgs_per_dev = div_u64(num_raid_units, data_devs);
+			sgs_per_dev = div_u64(num_raid_units, data_devs) + 2;
 		} else {
 			/* For Writes add parity pages array. */
 			max_par_pages = num_raid_units * pages_in_unit *
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 29c47e5..414a2df 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -551,7 +551,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
 			    unsigned cur_len)
 {
 	if (ios->reading) {
-		BUG_ON(per_dev->cur_sg >= ios->sgs_per_dev);
+		if (per_dev->cur_sg >= ios->sgs_per_dev) {
+			ORE_DBGMSG("cur_sg(%d) >= sgs_per_dev(%d)\n" ,
+				per_dev->cur_sg, ios->sgs_per_dev);
+			return -ENOMEM;
+		}
 		_ore_add_sg_seg(per_dev, cur_len, true);
 	} else {
 		struct __stripe_pages_2d *sp2d = ios->sp2d;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] ore: Must support none-PAGE-aligned IO
  2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
                   ` (2 preceding siblings ...)
  2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
@ 2012-01-06 14:46 ` Boaz Harrosh
  2012-01-08  8:50   ` [PATCH 4/4 ver2] " Boaz Harrosh
  2012-01-08  8:50   ` [PATCH 4/4] " Boaz Harrosh
  3 siblings, 2 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:46 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree


NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.

We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.

TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/ore_raid.c |   71 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 414a2df..b3047ef 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
 /* @si contains info of the to-be-inserted page. Update of @si should be
  * maintained by caller. Specificaly si->dev, si->obj_offset, ...
  */
-static int _add_to_read_4_write(struct ore_io_state *ios,
-				struct ore_striping_info *si, struct page *page)
+static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
+		       struct page *page, unsigned pg_len)
 {
 	struct request_queue *q;
 	struct ore_per_dev_state *per_dev;
@@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
 		_ore_add_sg_seg(per_dev, gap, true);
 	}
 	q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
-	added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
-	if (unlikely(added_len != PAGE_SIZE)) {
+	added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
+	if (unlikely(added_len != pg_len)) {
 		ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
 			      per_dev->bio->bi_vcnt);
 		return -ENOMEM;
 	}
 
-	per_dev->length += PAGE_SIZE;
+	per_dev->length += pg_len;
 	return 0;
 }
 
+/* read the beginning of an unaligned first page */
+static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
+{
+	struct ore_striping_info si;
+	unsigned pg_len;
+
+	ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
+
+	pg_len = si.obj_offset % PAGE_SIZE;
+	si.obj_offset -= pg_len;
+
+	ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
+		   _LLU(si.obj_offset), pg_len, page->index, si.dev);
+
+	return _add_to_r4w(ios, &si, page, pg_len);
+}
+
+/* read the end of an incomplete last page */
+static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
+{
+	struct ore_striping_info si;
+	struct page *page;
+	unsigned pg_len, p, c;
+
+	ore_calc_stripe_info(ios->layout, *offset, 0, &si);
+
+	p = si.unit_off / PAGE_SIZE;
+	c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
+		       ios->layout->mirrors_p1, si.par_dev, si.dev);
+	page = ios->sp2d->_1p_stripes[p].pages[c];
+
+	pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
+	*offset += pg_len;
+
+	ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
+		   p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
+
+	BUG_ON(!page);
+
+	return _add_to_r4w(ios, &si, page, pg_len);
+}
+
 static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
 {
 	struct bio_vec *bv;
@@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios)
 			struct page **pp = &_1ps->pages[c];
 			bool uptodate;
 
-			if (*pp)
+			if (*pp) {
+				if (ios->offset % PAGE_SIZE)
+					/* Read the remainder of the page */
+					_add_to_r4w_first_page(ios, *pp);
 				/* to-be-written pages start here */
 				goto read_last_stripe;
+			}
 
 			*pp = ios->r4w->get_page(ios->private, offset,
 						 &uptodate);
@@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios)
 				return -ENOMEM;
 
 			if (!uptodate)
-				_add_to_read_4_write(ios, &read_si, *pp);
+				_add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
 
 			/* Mark read-pages to be cache_released */
 			_1ps->page_is_read[c] = true;
@@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios)
 	}
 
 read_last_stripe:
-	offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
-				PAGE_SIZE * PAGE_SIZE;
+	offset = ios->offset + ios->length;
+	if (offset % PAGE_SIZE)
+		_add_to_r4w_last_page(ios, &offset);
+		/* offset will be aligned to next page */
+
 	last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
 				 * bytes_in_stripe;
 	if (offset == last_stripe_end) /* Optimize for the aligned case */
@@ -503,7 +552,7 @@ read_last_stripe:
 			/* Mark read-pages to be cache_released */
 			_1ps->page_is_read[c] = true;
 			if (!uptodate)
-				_add_to_read_4_write(ios, &read_si, page);
+				_add_to_r4w(ios, &read_si, page, PAGE_SIZE);
 		}
 
 		offset += PAGE_SIZE;
@@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
 			return -ENOMEM;
 		}
 
-		BUG_ON(ios->offset % PAGE_SIZE);
-
 		/* Round io down to last full strip */
 		first_stripe = div_u64(ios->offset, stripe_size);
 		last_stripe = div_u64(ios->offset + ios->length, stripe_size);
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
  2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
@ 2012-01-07 18:19   ` Randy Dunlap
  0 siblings, 0 replies; 8+ messages in thread
From: Randy Dunlap @ 2012-01-07 18:19 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, open-osd, Stable Tree

On 01/06/2012 06:40 AM, Boaz Harrosh wrote:
> 
> As Reported by Randy Dunlap
> 
> When MISC_FILESYSTEMS is not enabled and NFS4.1 is:
> 
> fs/built-in.o: In function `objio_alloc_io_state':
> objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
> fs/built-in.o: In function `_write_done':
> objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
> fs/built-in.o: In function `_read_done':
> ...
> 
> When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
> is not selected. exofs/Kconfig is never examined during Kconfig,
> and it can not do it's magic stuff to automatically select everything
> needed.
> 
> We must split exofs/Kconfig in two. The ore one is always included.
> And the exofs one is left in it's old place in the menu.
> 
> [Needed for the 3.2.0 Kernel]
> CC: Stable Tree <stable@kernel.org>
> Reported-by: Randy Dunlap <rdunlap@xenotime.net>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

Acked-by: Randy Dunlap <rdunlap@xenotime.net>

Thanks.

> ---
>  fs/Kconfig           |    2 ++
>  fs/exofs/Kconfig     |   11 -----------
>  fs/exofs/Kconfig.ore |   12 ++++++++++++
>  3 files changed, 14 insertions(+), 11 deletions(-)
>  create mode 100644 fs/exofs/Kconfig.ore
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 5f4c45d..6ad58a5 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -218,6 +218,8 @@ source "fs/exofs/Kconfig"
>  
>  endif # MISC_FILESYSTEMS
>  
> +source "fs/exofs/Kconfig.ore"
> +
>  menuconfig NETWORK_FILESYSTEMS
>  	bool "Network File Systems"
>  	default y
> diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
> index da42f32..86194b2 100644
> --- a/fs/exofs/Kconfig
> +++ b/fs/exofs/Kconfig
> @@ -1,14 +1,3 @@
> -# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
> -# for every ORE user we do it like this. Any user should add itself here
> -# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
> -# selected here, and we default to "ON". So in effect it is like been
> -# selected by any of the users.
> -config ORE
> -	tristate
> -	depends on EXOFS_FS || PNFS_OBJLAYOUT
> -	select ASYNC_XOR
> -	default SCSI_OSD_ULD
> -
>  config EXOFS_FS
>  	tristate "exofs: OSD based file system support"
>  	depends on SCSI_OSD_ULD
> diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore
> new file mode 100644
> index 0000000..1ca7fb7
> --- /dev/null
> +++ b/fs/exofs/Kconfig.ore
> @@ -0,0 +1,12 @@
> +# ORE - Objects Raid Engine (libore.ko)
> +#
> +# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
> +# for every ORE user we do it like this. Any user should add itself here
> +# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
> +# selected here, and we default to "ON". So in effect it is like been
> +# selected by any of the users.
> +config ORE
> +	tristate
> +	depends on EXOFS_FS || PNFS_OBJLAYOUT
> +	select ASYNC_XOR
> +	default SCSI_OSD_ULD


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/4 ver2] ore: Must support none-PAGE-aligned IO
  2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
@ 2012-01-08  8:50   ` Boaz Harrosh
  2012-01-08  8:50   ` [PATCH 4/4] " Boaz Harrosh
  1 sibling, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-08  8:50 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree


NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.

We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.

TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.

[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/exofs/ore_raid.c |   72 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 414a2df..d222c77 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
 /* @si contains info of the to-be-inserted page. Update of @si should be
  * maintained by caller. Specificaly si->dev, si->obj_offset, ...
  */
-static int _add_to_read_4_write(struct ore_io_state *ios,
-				struct ore_striping_info *si, struct page *page)
+static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
+		       struct page *page, unsigned pg_len)
 {
 	struct request_queue *q;
 	struct ore_per_dev_state *per_dev;
@@ -366,17 +366,60 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
 		_ore_add_sg_seg(per_dev, gap, true);
 	}
 	q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
-	added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
-	if (unlikely(added_len != PAGE_SIZE)) {
+	added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
+				    si->obj_offset % PAGE_SIZE);
+	if (unlikely(added_len != pg_len)) {
 		ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
 			      per_dev->bio->bi_vcnt);
 		return -ENOMEM;
 	}
 
-	per_dev->length += PAGE_SIZE;
+	per_dev->length += pg_len;
 	return 0;
 }
 
+/* read the beginning of an unaligned first page */
+static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
+{
+	struct ore_striping_info si;
+	unsigned pg_len;
+
+	ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
+
+	pg_len = si.obj_offset % PAGE_SIZE;
+	si.obj_offset -= pg_len;
+
+	ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
+		   _LLU(si.obj_offset), pg_len, page->index, si.dev);
+
+	return _add_to_r4w(ios, &si, page, pg_len);
+}
+
+/* read the end of an incomplete last page */
+static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
+{
+	struct ore_striping_info si;
+	struct page *page;
+	unsigned pg_len, p, c;
+
+	ore_calc_stripe_info(ios->layout, *offset, 0, &si);
+
+	p = si.unit_off / PAGE_SIZE;
+	c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
+		       ios->layout->mirrors_p1, si.par_dev, si.dev);
+	page = ios->sp2d->_1p_stripes[p].pages[c];
+
+	pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
+	*offset += pg_len;
+
+	ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
+		   p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
+
+	BUG_ON(!page);
+
+	return _add_to_r4w(ios, &si, page, pg_len);
+}
+
 static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
 {
 	struct bio_vec *bv;
@@ -444,9 +487,13 @@ static int _read_4_write(struct ore_io_state *ios)
 			struct page **pp = &_1ps->pages[c];
 			bool uptodate;
 
-			if (*pp)
+			if (*pp) {
+				if (ios->offset % PAGE_SIZE)
+					/* Read the remainder of the page */
+					_add_to_r4w_first_page(ios, *pp);
 				/* to-be-written pages start here */
 				goto read_last_stripe;
+			}
 
 			*pp = ios->r4w->get_page(ios->private, offset,
 						 &uptodate);
@@ -454,7 +501,7 @@ static int _read_4_write(struct ore_io_state *ios)
 				return -ENOMEM;
 
 			if (!uptodate)
-				_add_to_read_4_write(ios, &read_si, *pp);
+				_add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
 
 			/* Mark read-pages to be cache_released */
 			_1ps->page_is_read[c] = true;
@@ -465,8 +512,11 @@ static int _read_4_write(struct ore_io_state *ios)
 	}
 
 read_last_stripe:
-	offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
-				PAGE_SIZE * PAGE_SIZE;
+	offset = ios->offset + ios->length;
+	if (offset % PAGE_SIZE)
+		_add_to_r4w_last_page(ios, &offset);
+		/* offset will be aligned to next page */
+
 	last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
 				 * bytes_in_stripe;
 	if (offset == last_stripe_end) /* Optimize for the aligned case */
@@ -503,7 +553,7 @@ read_last_stripe:
 			/* Mark read-pages to be cache_released */
 			_1ps->page_is_read[c] = true;
 			if (!uptodate)
-				_add_to_read_4_write(ios, &read_si, page);
+				_add_to_r4w(ios, &read_si, page, PAGE_SIZE);
 		}
 
 		offset += PAGE_SIZE;
@@ -616,8 +666,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
 			return -ENOMEM;
 		}
 
-		BUG_ON(ios->offset % PAGE_SIZE);
-
 		/* Round io down to last full strip */
 		first_stripe = div_u64(ios->offset, stripe_size);
 		last_stripe = div_u64(ios->offset + ios->length, stripe_size);
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] ore: Must support none-PAGE-aligned IO
  2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
  2012-01-08  8:50   ` [PATCH 4/4 ver2] " Boaz Harrosh
@ 2012-01-08  8:50   ` Boaz Harrosh
  1 sibling, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-08  8:50 UTC (permalink / raw)
  To: linux-fsdevel, open-osd, Stable Tree

On 01/06/2012 04:46 PM, Boaz Harrosh wrote:
> 
> NFS might send us offsets that are not PAGE aligned. So
> we must read in the reminder of the first/last pages, in cases
> we need it for Parity calculations.
> 
> We only add an sg segments to read the partial page. But
> we don't mark it as read=true because it is a lock-for-write
> page.
> 
> TODO: In some cases (IO spans a single unit) we can just
> adjust the raid_unit offset/length, but this is left for
> later Kernels.
> 
> [Bug in 3.2.0 Kernel]
> CC: Stable Tree <stable@kernel.org>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

This patch had a data corruption bug. I'll post a version 2

Here is the diff of ver2 from ver1
---
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index b3047ef..d222c77 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -366,7 +366,8 @@ static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
 		_ore_add_sg_seg(per_dev, gap, true);
 	}
 	q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
-	added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
+	added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
+				    si->obj_offset % PAGE_SIZE);
 	if (unlikely(added_len != pg_len)) {
 		ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
 			      per_dev->bio->bi_vcnt);


> ---
>  fs/exofs/ore_raid.c |   71 ++++++++++++++++++++++++++++++++++++++++++--------
>  1 files changed, 59 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
> index 414a2df..b3047ef 100644
> --- a/fs/exofs/ore_raid.c
> +++ b/fs/exofs/ore_raid.c
> @@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
>  /* @si contains info of the to-be-inserted page. Update of @si should be
>   * maintained by caller. Specificaly si->dev, si->obj_offset, ...
>   */
> -static int _add_to_read_4_write(struct ore_io_state *ios,
> -				struct ore_striping_info *si, struct page *page)
> +static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
> +		       struct page *page, unsigned pg_len)
>  {
>  	struct request_queue *q;
>  	struct ore_per_dev_state *per_dev;
> @@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
>  		_ore_add_sg_seg(per_dev, gap, true);
>  	}
>  	q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
> -	added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
> -	if (unlikely(added_len != PAGE_SIZE)) {
> +	added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
> +	if (unlikely(added_len != pg_len)) {
>  		ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
>  			      per_dev->bio->bi_vcnt);
>  		return -ENOMEM;
>  	}
>  
> -	per_dev->length += PAGE_SIZE;
> +	per_dev->length += pg_len;
>  	return 0;
>  }
>  
> +/* read the beginning of an unaligned first page */
> +static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
> +{
> +	struct ore_striping_info si;
> +	unsigned pg_len;
> +
> +	ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
> +
> +	pg_len = si.obj_offset % PAGE_SIZE;
> +	si.obj_offset -= pg_len;
> +
> +	ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
> +		   _LLU(si.obj_offset), pg_len, page->index, si.dev);
> +
> +	return _add_to_r4w(ios, &si, page, pg_len);
> +}
> +
> +/* read the end of an incomplete last page */
> +static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
> +{
> +	struct ore_striping_info si;
> +	struct page *page;
> +	unsigned pg_len, p, c;
> +
> +	ore_calc_stripe_info(ios->layout, *offset, 0, &si);
> +
> +	p = si.unit_off / PAGE_SIZE;
> +	c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
> +		       ios->layout->mirrors_p1, si.par_dev, si.dev);
> +	page = ios->sp2d->_1p_stripes[p].pages[c];
> +
> +	pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
> +	*offset += pg_len;
> +
> +	ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
> +		   p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
> +
> +	BUG_ON(!page);
> +
> +	return _add_to_r4w(ios, &si, page, pg_len);
> +}
> +
>  static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
>  {
>  	struct bio_vec *bv;
> @@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios)
>  			struct page **pp = &_1ps->pages[c];
>  			bool uptodate;
>  
> -			if (*pp)
> +			if (*pp) {
> +				if (ios->offset % PAGE_SIZE)
> +					/* Read the remainder of the page */
> +					_add_to_r4w_first_page(ios, *pp);
>  				/* to-be-written pages start here */
>  				goto read_last_stripe;
> +			}
>  
>  			*pp = ios->r4w->get_page(ios->private, offset,
>  						 &uptodate);
> @@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios)
>  				return -ENOMEM;
>  
>  			if (!uptodate)
> -				_add_to_read_4_write(ios, &read_si, *pp);
> +				_add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
>  
>  			/* Mark read-pages to be cache_released */
>  			_1ps->page_is_read[c] = true;
> @@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios)
>  	}
>  
>  read_last_stripe:
> -	offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
> -				PAGE_SIZE * PAGE_SIZE;
> +	offset = ios->offset + ios->length;
> +	if (offset % PAGE_SIZE)
> +		_add_to_r4w_last_page(ios, &offset);
> +		/* offset will be aligned to next page */
> +
>  	last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
>  				 * bytes_in_stripe;
>  	if (offset == last_stripe_end) /* Optimize for the aligned case */
> @@ -503,7 +552,7 @@ read_last_stripe:
>  			/* Mark read-pages to be cache_released */
>  			_1ps->page_is_read[c] = true;
>  			if (!uptodate)
> -				_add_to_read_4_write(ios, &read_si, page);
> +				_add_to_r4w(ios, &read_si, page, PAGE_SIZE);
>  		}
>  
>  		offset += PAGE_SIZE;
> @@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
>  			return -ENOMEM;
>  		}
>  
> -		BUG_ON(ios->offset % PAGE_SIZE);
> -
>  		/* Round io down to last full strip */
>  		first_stripe = div_u64(ios->offset, stripe_size);
>  		last_stripe = div_u64(ios->offset + ios->length, stripe_size);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-01-08  8:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
2012-01-07 18:19   ` Randy Dunlap
2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
2012-01-08  8:50   ` [PATCH 4/4 ver2] " Boaz Harrosh
2012-01-08  8:50   ` [PATCH 4/4] " Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).