* [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@)
@ 2012-01-06 14:37 Boaz Harrosh
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:37 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap
October's large testing has unearthed some nasty bugs. Do to
my Deficiency I have failed to send them for the 3.2 Kernel,
so here they are for the 3.3 Merge window, and also for
the Stable@ tree.
Sorry!
[PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
Random Kconfig breakage found by Randy Dunlap. Thanks Randy.
[PATCH 2/4] ore: Fix crash in case of an IO error.
[PATCH 3/4] ore: fix BUG_ON, too few sgs when reading
[PATCH 4/4] ore: Must support none-PAGE-aligned IO
All these are BUG_ON(s) and crashes. Most interesting
is the last one which proves that the 3.2 RAID engine
was a bit premature.
After this code both exofs and pNFS pass tests including RAID5
verification. Cheers! (exofs did so before pNFS didn't)
Thanks
Boaz
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
@ 2012-01-06 14:40 ` Boaz Harrosh
2012-01-07 18:19 ` Randy Dunlap
2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:40 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap
As Reported by Randy Dunlap
When MISC_FILESYSTEMS is not enabled and NFS4.1 is:
fs/built-in.o: In function `objio_alloc_io_state':
objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
fs/built-in.o: In function `_write_done':
objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
fs/built-in.o: In function `_read_done':
...
When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
is not selected. exofs/Kconfig is never examined during Kconfig,
and it can not do it's magic stuff to automatically select everything
needed.
We must split exofs/Kconfig in two. The ore one is always included.
And the exofs one is left in it's old place in the menu.
[Needed for the 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/Kconfig | 2 ++
fs/exofs/Kconfig | 11 -----------
fs/exofs/Kconfig.ore | 12 ++++++++++++
3 files changed, 14 insertions(+), 11 deletions(-)
create mode 100644 fs/exofs/Kconfig.ore
diff --git a/fs/Kconfig b/fs/Kconfig
index 5f4c45d..6ad58a5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -218,6 +218,8 @@ source "fs/exofs/Kconfig"
endif # MISC_FILESYSTEMS
+source "fs/exofs/Kconfig.ore"
+
menuconfig NETWORK_FILESYSTEMS
bool "Network File Systems"
default y
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
index da42f32..86194b2 100644
--- a/fs/exofs/Kconfig
+++ b/fs/exofs/Kconfig
@@ -1,14 +1,3 @@
-# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
-# for every ORE user we do it like this. Any user should add itself here
-# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
-# selected here, and we default to "ON". So in effect it is like been
-# selected by any of the users.
-config ORE
- tristate
- depends on EXOFS_FS || PNFS_OBJLAYOUT
- select ASYNC_XOR
- default SCSI_OSD_ULD
-
config EXOFS_FS
tristate "exofs: OSD based file system support"
depends on SCSI_OSD_ULD
diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore
new file mode 100644
index 0000000..1ca7fb7
--- /dev/null
+++ b/fs/exofs/Kconfig.ore
@@ -0,0 +1,12 @@
+# ORE - Objects Raid Engine (libore.ko)
+#
+# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
+# for every ORE user we do it like this. Any user should add itself here
+# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
+# selected here, and we default to "ON". So in effect it is like been
+# selected by any of the users.
+config ORE
+ tristate
+ depends on EXOFS_FS || PNFS_OBJLAYOUT
+ select ASYNC_XOR
+ default SCSI_OSD_ULD
--
1.7.2.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/4] ore: Fix crash in case of an IO error.
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
@ 2012-01-06 14:42 ` Boaz Harrosh
2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
3 siblings, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:42 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree
The users of ore_check_io() expect the reported device
(In case of error) to be indexed relative to the passed-in
ore_components table, and not the logical dev index.
This causes a crash inside objlayoutdriver in case of
an IO error.
[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/exofs/ore.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index d271ad8..894f3e1 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -445,10 +445,10 @@ int ore_check_io(struct ore_io_state *ios, ore_on_dev_error on_dev_error)
u64 residual = ios->reading ?
or->in.residual : or->out.residual;
u64 offset = (ios->offset + ios->length) - residual;
- struct ore_dev *od = ios->oc->ods[
- per_dev->dev - ios->oc->first_dev];
+ unsigned dev = per_dev->dev - ios->oc->first_dev;
+ struct ore_dev *od = ios->oc->ods[dev];
- on_dev_error(ios, od, per_dev->dev, osi.osd_err_pri,
+ on_dev_error(ios, od, dev, osi.osd_err_pri,
offset, residual);
}
if (osi.osd_err_pri >= acumulated_osd_err) {
--
1.7.2.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
@ 2012-01-06 14:43 ` Boaz Harrosh
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
3 siblings, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:43 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree
When reading RAID5 files, in rare cases, we calculated too
few sg segments. There should be two extra for the beginning
and end partial units.
Also "too few sg segments" should not be a BUG_ON there is
all the mechanics in place to handle it, as a short read.
So just return -ENOMEM and the rest of the code will gracefully
split the IO.
[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/exofs/ore.c | 2 +-
fs/exofs/ore_raid.c | 6 +++++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 894f3e1..49cf230 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -266,7 +266,7 @@ int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
/* first/last seg is split */
num_raid_units += layout->group_width;
- sgs_per_dev = div_u64(num_raid_units, data_devs);
+ sgs_per_dev = div_u64(num_raid_units, data_devs) + 2;
} else {
/* For Writes add parity pages array. */
max_par_pages = num_raid_units * pages_in_unit *
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 29c47e5..414a2df 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -551,7 +551,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
unsigned cur_len)
{
if (ios->reading) {
- BUG_ON(per_dev->cur_sg >= ios->sgs_per_dev);
+ if (per_dev->cur_sg >= ios->sgs_per_dev) {
+ ORE_DBGMSG("cur_sg(%d) >= sgs_per_dev(%d)\n" ,
+ per_dev->cur_sg, ios->sgs_per_dev);
+ return -ENOMEM;
+ }
_ore_add_sg_seg(per_dev, cur_len, true);
} else {
struct __stripe_pages_2d *sp2d = ios->sp2d;
--
1.7.2.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4] ore: Must support none-PAGE-aligned IO
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
` (2 preceding siblings ...)
2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
@ 2012-01-06 14:46 ` Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh
3 siblings, 2 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-06 14:46 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree
NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.
We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.
TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.
[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/exofs/ore_raid.c | 71 ++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 59 insertions(+), 12 deletions(-)
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 414a2df..b3047ef 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
/* @si contains info of the to-be-inserted page. Update of @si should be
* maintained by caller. Specificaly si->dev, si->obj_offset, ...
*/
-static int _add_to_read_4_write(struct ore_io_state *ios,
- struct ore_striping_info *si, struct page *page)
+static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
+ struct page *page, unsigned pg_len)
{
struct request_queue *q;
struct ore_per_dev_state *per_dev;
@@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
_ore_add_sg_seg(per_dev, gap, true);
}
q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
- added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
- if (unlikely(added_len != PAGE_SIZE)) {
+ added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
+ if (unlikely(added_len != pg_len)) {
ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
per_dev->bio->bi_vcnt);
return -ENOMEM;
}
- per_dev->length += PAGE_SIZE;
+ per_dev->length += pg_len;
return 0;
}
+/* read the beginning of an unaligned first page */
+static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
+{
+ struct ore_striping_info si;
+ unsigned pg_len;
+
+ ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
+
+ pg_len = si.obj_offset % PAGE_SIZE;
+ si.obj_offset -= pg_len;
+
+ ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
+ _LLU(si.obj_offset), pg_len, page->index, si.dev);
+
+ return _add_to_r4w(ios, &si, page, pg_len);
+}
+
+/* read the end of an incomplete last page */
+static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
+{
+ struct ore_striping_info si;
+ struct page *page;
+ unsigned pg_len, p, c;
+
+ ore_calc_stripe_info(ios->layout, *offset, 0, &si);
+
+ p = si.unit_off / PAGE_SIZE;
+ c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
+ ios->layout->mirrors_p1, si.par_dev, si.dev);
+ page = ios->sp2d->_1p_stripes[p].pages[c];
+
+ pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
+ *offset += pg_len;
+
+ ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
+ p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
+
+ BUG_ON(!page);
+
+ return _add_to_r4w(ios, &si, page, pg_len);
+}
+
static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
{
struct bio_vec *bv;
@@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios)
struct page **pp = &_1ps->pages[c];
bool uptodate;
- if (*pp)
+ if (*pp) {
+ if (ios->offset % PAGE_SIZE)
+ /* Read the remainder of the page */
+ _add_to_r4w_first_page(ios, *pp);
/* to-be-written pages start here */
goto read_last_stripe;
+ }
*pp = ios->r4w->get_page(ios->private, offset,
&uptodate);
@@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios)
return -ENOMEM;
if (!uptodate)
- _add_to_read_4_write(ios, &read_si, *pp);
+ _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
@@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios)
}
read_last_stripe:
- offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
- PAGE_SIZE * PAGE_SIZE;
+ offset = ios->offset + ios->length;
+ if (offset % PAGE_SIZE)
+ _add_to_r4w_last_page(ios, &offset);
+ /* offset will be aligned to next page */
+
last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
* bytes_in_stripe;
if (offset == last_stripe_end) /* Optimize for the aligned case */
@@ -503,7 +552,7 @@ read_last_stripe:
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
if (!uptodate)
- _add_to_read_4_write(ios, &read_si, page);
+ _add_to_r4w(ios, &read_si, page, PAGE_SIZE);
}
offset += PAGE_SIZE;
@@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
return -ENOMEM;
}
- BUG_ON(ios->offset % PAGE_SIZE);
-
/* Round io down to last full strip */
first_stripe = div_u64(ios->offset, stripe_size);
last_stripe = div_u64(ios->offset + ios->length, stripe_size);
--
1.7.2.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
@ 2012-01-07 18:19 ` Randy Dunlap
0 siblings, 0 replies; 8+ messages in thread
From: Randy Dunlap @ 2012-01-07 18:19 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: linux-fsdevel, open-osd, Stable Tree
On 01/06/2012 06:40 AM, Boaz Harrosh wrote:
>
> As Reported by Randy Dunlap
>
> When MISC_FILESYSTEMS is not enabled and NFS4.1 is:
>
> fs/built-in.o: In function `objio_alloc_io_state':
> objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
> fs/built-in.o: In function `_write_done':
> objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
> fs/built-in.o: In function `_read_done':
> ...
>
> When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
> is not selected. exofs/Kconfig is never examined during Kconfig,
> and it can not do it's magic stuff to automatically select everything
> needed.
>
> We must split exofs/Kconfig in two. The ore one is always included.
> And the exofs one is left in it's old place in the menu.
>
> [Needed for the 3.2.0 Kernel]
> CC: Stable Tree <stable@kernel.org>
> Reported-by: Randy Dunlap <rdunlap@xenotime.net>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Thanks.
> ---
> fs/Kconfig | 2 ++
> fs/exofs/Kconfig | 11 -----------
> fs/exofs/Kconfig.ore | 12 ++++++++++++
> 3 files changed, 14 insertions(+), 11 deletions(-)
> create mode 100644 fs/exofs/Kconfig.ore
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 5f4c45d..6ad58a5 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -218,6 +218,8 @@ source "fs/exofs/Kconfig"
>
> endif # MISC_FILESYSTEMS
>
> +source "fs/exofs/Kconfig.ore"
> +
> menuconfig NETWORK_FILESYSTEMS
> bool "Network File Systems"
> default y
> diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
> index da42f32..86194b2 100644
> --- a/fs/exofs/Kconfig
> +++ b/fs/exofs/Kconfig
> @@ -1,14 +1,3 @@
> -# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
> -# for every ORE user we do it like this. Any user should add itself here
> -# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
> -# selected here, and we default to "ON". So in effect it is like been
> -# selected by any of the users.
> -config ORE
> - tristate
> - depends on EXOFS_FS || PNFS_OBJLAYOUT
> - select ASYNC_XOR
> - default SCSI_OSD_ULD
> -
> config EXOFS_FS
> tristate "exofs: OSD based file system support"
> depends on SCSI_OSD_ULD
> diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore
> new file mode 100644
> index 0000000..1ca7fb7
> --- /dev/null
> +++ b/fs/exofs/Kconfig.ore
> @@ -0,0 +1,12 @@
> +# ORE - Objects Raid Engine (libore.ko)
> +#
> +# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
> +# for every ORE user we do it like this. Any user should add itself here
> +# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
> +# selected here, and we default to "ON". So in effect it is like been
> +# selected by any of the users.
> +config ORE
> + tristate
> + depends on EXOFS_FS || PNFS_OBJLAYOUT
> + select ASYNC_XOR
> + default SCSI_OSD_ULD
--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 4/4 ver2] ore: Must support none-PAGE-aligned IO
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
@ 2012-01-08 8:50 ` Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh
1 sibling, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-08 8:50 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree
NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.
We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.
TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.
[Bug in 3.2.0 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/exofs/ore_raid.c | 72 ++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 414a2df..d222c77 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
/* @si contains info of the to-be-inserted page. Update of @si should be
* maintained by caller. Specificaly si->dev, si->obj_offset, ...
*/
-static int _add_to_read_4_write(struct ore_io_state *ios,
- struct ore_striping_info *si, struct page *page)
+static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
+ struct page *page, unsigned pg_len)
{
struct request_queue *q;
struct ore_per_dev_state *per_dev;
@@ -366,17 +366,60 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
_ore_add_sg_seg(per_dev, gap, true);
}
q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
- added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
- if (unlikely(added_len != PAGE_SIZE)) {
+ added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
+ si->obj_offset % PAGE_SIZE);
+ if (unlikely(added_len != pg_len)) {
ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
per_dev->bio->bi_vcnt);
return -ENOMEM;
}
- per_dev->length += PAGE_SIZE;
+ per_dev->length += pg_len;
return 0;
}
+/* read the beginning of an unaligned first page */
+static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
+{
+ struct ore_striping_info si;
+ unsigned pg_len;
+
+ ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
+
+ pg_len = si.obj_offset % PAGE_SIZE;
+ si.obj_offset -= pg_len;
+
+ ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
+ _LLU(si.obj_offset), pg_len, page->index, si.dev);
+
+ return _add_to_r4w(ios, &si, page, pg_len);
+}
+
+/* read the end of an incomplete last page */
+static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
+{
+ struct ore_striping_info si;
+ struct page *page;
+ unsigned pg_len, p, c;
+
+ ore_calc_stripe_info(ios->layout, *offset, 0, &si);
+
+ p = si.unit_off / PAGE_SIZE;
+ c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
+ ios->layout->mirrors_p1, si.par_dev, si.dev);
+ page = ios->sp2d->_1p_stripes[p].pages[c];
+
+ pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
+ *offset += pg_len;
+
+ ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
+ p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
+
+ BUG_ON(!page);
+
+ return _add_to_r4w(ios, &si, page, pg_len);
+}
+
static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
{
struct bio_vec *bv;
@@ -444,9 +487,13 @@ static int _read_4_write(struct ore_io_state *ios)
struct page **pp = &_1ps->pages[c];
bool uptodate;
- if (*pp)
+ if (*pp) {
+ if (ios->offset % PAGE_SIZE)
+ /* Read the remainder of the page */
+ _add_to_r4w_first_page(ios, *pp);
/* to-be-written pages start here */
goto read_last_stripe;
+ }
*pp = ios->r4w->get_page(ios->private, offset,
&uptodate);
@@ -454,7 +501,7 @@ static int _read_4_write(struct ore_io_state *ios)
return -ENOMEM;
if (!uptodate)
- _add_to_read_4_write(ios, &read_si, *pp);
+ _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
@@ -465,8 +512,11 @@ static int _read_4_write(struct ore_io_state *ios)
}
read_last_stripe:
- offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
- PAGE_SIZE * PAGE_SIZE;
+ offset = ios->offset + ios->length;
+ if (offset % PAGE_SIZE)
+ _add_to_r4w_last_page(ios, &offset);
+ /* offset will be aligned to next page */
+
last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
* bytes_in_stripe;
if (offset == last_stripe_end) /* Optimize for the aligned case */
@@ -503,7 +553,7 @@ read_last_stripe:
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
if (!uptodate)
- _add_to_read_4_write(ios, &read_si, page);
+ _add_to_r4w(ios, &read_si, page, PAGE_SIZE);
}
offset += PAGE_SIZE;
@@ -616,8 +666,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
return -ENOMEM;
}
- BUG_ON(ios->offset % PAGE_SIZE);
-
/* Round io down to last full strip */
first_stripe = div_u64(ios->offset, stripe_size);
last_stripe = div_u64(ios->offset + ios->length, stripe_size);
--
1.7.2.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] ore: Must support none-PAGE-aligned IO
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh
@ 2012-01-08 8:50 ` Boaz Harrosh
1 sibling, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2012-01-08 8:50 UTC (permalink / raw)
To: linux-fsdevel, open-osd, Stable Tree
On 01/06/2012 04:46 PM, Boaz Harrosh wrote:
>
> NFS might send us offsets that are not PAGE aligned. So
> we must read in the reminder of the first/last pages, in cases
> we need it for Parity calculations.
>
> We only add an sg segments to read the partial page. But
> we don't mark it as read=true because it is a lock-for-write
> page.
>
> TODO: In some cases (IO spans a single unit) we can just
> adjust the raid_unit offset/length, but this is left for
> later Kernels.
>
> [Bug in 3.2.0 Kernel]
> CC: Stable Tree <stable@kernel.org>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch had a data corruption bug. I'll post a version 2
Here is the diff of ver2 from ver1
---
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index b3047ef..d222c77 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -366,7 +366,8 @@ static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
_ore_add_sg_seg(per_dev, gap, true);
}
q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
- added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
+ added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
+ si->obj_offset % PAGE_SIZE);
if (unlikely(added_len != pg_len)) {
ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
per_dev->bio->bi_vcnt);
> ---
> fs/exofs/ore_raid.c | 71 ++++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 59 insertions(+), 12 deletions(-)
>
> diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
> index 414a2df..b3047ef 100644
> --- a/fs/exofs/ore_raid.c
> +++ b/fs/exofs/ore_raid.c
> @@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios)
> /* @si contains info of the to-be-inserted page. Update of @si should be
> * maintained by caller. Specificaly si->dev, si->obj_offset, ...
> */
> -static int _add_to_read_4_write(struct ore_io_state *ios,
> - struct ore_striping_info *si, struct page *page)
> +static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
> + struct page *page, unsigned pg_len)
> {
> struct request_queue *q;
> struct ore_per_dev_state *per_dev;
> @@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios,
> _ore_add_sg_seg(per_dev, gap, true);
> }
> q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
> - added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
> - if (unlikely(added_len != PAGE_SIZE)) {
> + added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0);
> + if (unlikely(added_len != pg_len)) {
> ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
> per_dev->bio->bi_vcnt);
> return -ENOMEM;
> }
>
> - per_dev->length += PAGE_SIZE;
> + per_dev->length += pg_len;
> return 0;
> }
>
> +/* read the beginning of an unaligned first page */
> +static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
> +{
> + struct ore_striping_info si;
> + unsigned pg_len;
> +
> + ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
> +
> + pg_len = si.obj_offset % PAGE_SIZE;
> + si.obj_offset -= pg_len;
> +
> + ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
> + _LLU(si.obj_offset), pg_len, page->index, si.dev);
> +
> + return _add_to_r4w(ios, &si, page, pg_len);
> +}
> +
> +/* read the end of an incomplete last page */
> +static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
> +{
> + struct ore_striping_info si;
> + struct page *page;
> + unsigned pg_len, p, c;
> +
> + ore_calc_stripe_info(ios->layout, *offset, 0, &si);
> +
> + p = si.unit_off / PAGE_SIZE;
> + c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
> + ios->layout->mirrors_p1, si.par_dev, si.dev);
> + page = ios->sp2d->_1p_stripes[p].pages[c];
> +
> + pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
> + *offset += pg_len;
> +
> + ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
> + p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
> +
> + BUG_ON(!page);
> +
> + return _add_to_r4w(ios, &si, page, pg_len);
> +}
> +
> static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
> {
> struct bio_vec *bv;
> @@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios)
> struct page **pp = &_1ps->pages[c];
> bool uptodate;
>
> - if (*pp)
> + if (*pp) {
> + if (ios->offset % PAGE_SIZE)
> + /* Read the remainder of the page */
> + _add_to_r4w_first_page(ios, *pp);
> /* to-be-written pages start here */
> goto read_last_stripe;
> + }
>
> *pp = ios->r4w->get_page(ios->private, offset,
> &uptodate);
> @@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios)
> return -ENOMEM;
>
> if (!uptodate)
> - _add_to_read_4_write(ios, &read_si, *pp);
> + _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
>
> /* Mark read-pages to be cache_released */
> _1ps->page_is_read[c] = true;
> @@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios)
> }
>
> read_last_stripe:
> - offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
> - PAGE_SIZE * PAGE_SIZE;
> + offset = ios->offset + ios->length;
> + if (offset % PAGE_SIZE)
> + _add_to_r4w_last_page(ios, &offset);
> + /* offset will be aligned to next page */
> +
> last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
> * bytes_in_stripe;
> if (offset == last_stripe_end) /* Optimize for the aligned case */
> @@ -503,7 +552,7 @@ read_last_stripe:
> /* Mark read-pages to be cache_released */
> _1ps->page_is_read[c] = true;
> if (!uptodate)
> - _add_to_read_4_write(ios, &read_si, page);
> + _add_to_r4w(ios, &read_si, page, PAGE_SIZE);
> }
>
> offset += PAGE_SIZE;
> @@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
> return -ENOMEM;
> }
>
> - BUG_ON(ios->offset % PAGE_SIZE);
> -
> /* Round io down to last full strip */
> first_stripe = div_u64(ios->offset, stripe_size);
> last_stripe = div_u64(ios->offset + ios->length, stripe_size);
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-01-08 8:50 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh
2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh
2012-01-07 18:19 ` Randy Dunlap
2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh
2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh
2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh
2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).