* [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) @ 2012-01-06 14:37 Boaz Harrosh 2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh ` (3 more replies) 0 siblings, 4 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-06 14:37 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap October's large testing has unearthed some nasty bugs. Do to my Deficiency I have failed to send them for the 3.2 Kernel, so here they are for the 3.3 Merge window, and also for the Stable@ tree. Sorry! [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Random Kconfig breakage found by Randy Dunlap. Thanks Randy. [PATCH 2/4] ore: Fix crash in case of an IO error. [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading [PATCH 4/4] ore: Must support none-PAGE-aligned IO All these are BUG_ON(s) and crashes. Most interesting is the last one which proves that the 3.2 RAID engine was a bit premature. After this code both exofs and pNFS pass tests including RAID5 verification. Cheers! (exofs did so before pNFS didn't) Thanks Boaz ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set 2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh @ 2012-01-06 14:40 ` Boaz Harrosh 2012-01-07 18:19 ` Randy Dunlap 2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh ` (2 subsequent siblings) 3 siblings, 1 reply; 8+ messages in thread From: Boaz Harrosh @ 2012-01-06 14:40 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree; +Cc: Randy Dunlap As Reported by Randy Dunlap When MISC_FILESYSTEMS is not enabled and NFS4.1 is: fs/built-in.o: In function `objio_alloc_io_state': objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' fs/built-in.o: In function `_write_done': objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' fs/built-in.o: In function `_read_done': ... When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, is not selected. exofs/Kconfig is never examined during Kconfig, and it can not do it's magic stuff to automatically select everything needed. We must split exofs/Kconfig in two. The ore one is always included. And the exofs one is left in it's old place in the menu. [Needed for the 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/Kconfig | 2 ++ fs/exofs/Kconfig | 11 ----------- fs/exofs/Kconfig.ore | 12 ++++++++++++ 3 files changed, 14 insertions(+), 11 deletions(-) create mode 100644 fs/exofs/Kconfig.ore diff --git a/fs/Kconfig b/fs/Kconfig index 5f4c45d..6ad58a5 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -218,6 +218,8 @@ source "fs/exofs/Kconfig" endif # MISC_FILESYSTEMS +source "fs/exofs/Kconfig.ore" + menuconfig NETWORK_FILESYSTEMS bool "Network File Systems" default y diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig index da42f32..86194b2 100644 --- a/fs/exofs/Kconfig +++ b/fs/exofs/Kconfig @@ -1,14 +1,3 @@ -# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects -# for every ORE user we do it like this. Any user should add itself here -# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are -# selected here, and we default to "ON". So in effect it is like been -# selected by any of the users. -config ORE - tristate - depends on EXOFS_FS || PNFS_OBJLAYOUT - select ASYNC_XOR - default SCSI_OSD_ULD - config EXOFS_FS tristate "exofs: OSD based file system support" depends on SCSI_OSD_ULD diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore new file mode 100644 index 0000000..1ca7fb7 --- /dev/null +++ b/fs/exofs/Kconfig.ore @@ -0,0 +1,12 @@ +# ORE - Objects Raid Engine (libore.ko) +# +# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects +# for every ORE user we do it like this. Any user should add itself here +# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are +# selected here, and we default to "ON". So in effect it is like been +# selected by any of the users. +config ORE + tristate + depends on EXOFS_FS || PNFS_OBJLAYOUT + select ASYNC_XOR + default SCSI_OSD_ULD -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set 2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh @ 2012-01-07 18:19 ` Randy Dunlap 0 siblings, 0 replies; 8+ messages in thread From: Randy Dunlap @ 2012-01-07 18:19 UTC (permalink / raw) To: Boaz Harrosh; +Cc: linux-fsdevel, open-osd, Stable Tree On 01/06/2012 06:40 AM, Boaz Harrosh wrote: > > As Reported by Randy Dunlap > > When MISC_FILESYSTEMS is not enabled and NFS4.1 is: > > fs/built-in.o: In function `objio_alloc_io_state': > objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' > fs/built-in.o: In function `_write_done': > objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' > fs/built-in.o: In function `_read_done': > ... > > When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, > is not selected. exofs/Kconfig is never examined during Kconfig, > and it can not do it's magic stuff to automatically select everything > needed. > > We must split exofs/Kconfig in two. The ore one is always included. > And the exofs one is left in it's old place in the menu. > > [Needed for the 3.2.0 Kernel] > CC: Stable Tree <stable@kernel.org> > Reported-by: Randy Dunlap <rdunlap@xenotime.net> > Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Thanks. > --- > fs/Kconfig | 2 ++ > fs/exofs/Kconfig | 11 ----------- > fs/exofs/Kconfig.ore | 12 ++++++++++++ > 3 files changed, 14 insertions(+), 11 deletions(-) > create mode 100644 fs/exofs/Kconfig.ore > > diff --git a/fs/Kconfig b/fs/Kconfig > index 5f4c45d..6ad58a5 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -218,6 +218,8 @@ source "fs/exofs/Kconfig" > > endif # MISC_FILESYSTEMS > > +source "fs/exofs/Kconfig.ore" > + > menuconfig NETWORK_FILESYSTEMS > bool "Network File Systems" > default y > diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig > index da42f32..86194b2 100644 > --- a/fs/exofs/Kconfig > +++ b/fs/exofs/Kconfig > @@ -1,14 +1,3 @@ > -# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects > -# for every ORE user we do it like this. Any user should add itself here > -# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are > -# selected here, and we default to "ON". So in effect it is like been > -# selected by any of the users. > -config ORE > - tristate > - depends on EXOFS_FS || PNFS_OBJLAYOUT > - select ASYNC_XOR > - default SCSI_OSD_ULD > - > config EXOFS_FS > tristate "exofs: OSD based file system support" > depends on SCSI_OSD_ULD > diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore > new file mode 100644 > index 0000000..1ca7fb7 > --- /dev/null > +++ b/fs/exofs/Kconfig.ore > @@ -0,0 +1,12 @@ > +# ORE - Objects Raid Engine (libore.ko) > +# > +# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects > +# for every ORE user we do it like this. Any user should add itself here > +# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are > +# selected here, and we default to "ON". So in effect it is like been > +# selected by any of the users. > +config ORE > + tristate > + depends on EXOFS_FS || PNFS_OBJLAYOUT > + select ASYNC_XOR > + default SCSI_OSD_ULD -- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/4] ore: Fix crash in case of an IO error. 2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh 2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh @ 2012-01-06 14:42 ` Boaz Harrosh 2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh 2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh 3 siblings, 0 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-06 14:42 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree The users of ore_check_io() expect the reported device (In case of error) to be indexed relative to the passed-in ore_components table, and not the logical dev index. This causes a crash inside objlayoutdriver in case of an IO error. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/exofs/ore.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c index d271ad8..894f3e1 100644 --- a/fs/exofs/ore.c +++ b/fs/exofs/ore.c @@ -445,10 +445,10 @@ int ore_check_io(struct ore_io_state *ios, ore_on_dev_error on_dev_error) u64 residual = ios->reading ? or->in.residual : or->out.residual; u64 offset = (ios->offset + ios->length) - residual; - struct ore_dev *od = ios->oc->ods[ - per_dev->dev - ios->oc->first_dev]; + unsigned dev = per_dev->dev - ios->oc->first_dev; + struct ore_dev *od = ios->oc->ods[dev]; - on_dev_error(ios, od, per_dev->dev, osi.osd_err_pri, + on_dev_error(ios, od, dev, osi.osd_err_pri, offset, residual); } if (osi.osd_err_pri >= acumulated_osd_err) { -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading 2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh 2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh 2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh @ 2012-01-06 14:43 ` Boaz Harrosh 2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh 3 siblings, 0 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-06 14:43 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/exofs/ore.c | 2 +- fs/exofs/ore_raid.c | 6 +++++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c index 894f3e1..49cf230 100644 --- a/fs/exofs/ore.c +++ b/fs/exofs/ore.c @@ -266,7 +266,7 @@ int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc, /* first/last seg is split */ num_raid_units += layout->group_width; - sgs_per_dev = div_u64(num_raid_units, data_devs); + sgs_per_dev = div_u64(num_raid_units, data_devs) + 2; } else { /* For Writes add parity pages array. */ max_par_pages = num_raid_units * pages_in_unit * diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c index 29c47e5..414a2df 100644 --- a/fs/exofs/ore_raid.c +++ b/fs/exofs/ore_raid.c @@ -551,7 +551,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios, unsigned cur_len) { if (ios->reading) { - BUG_ON(per_dev->cur_sg >= ios->sgs_per_dev); + if (per_dev->cur_sg >= ios->sgs_per_dev) { + ORE_DBGMSG("cur_sg(%d) >= sgs_per_dev(%d)\n" , + per_dev->cur_sg, ios->sgs_per_dev); + return -ENOMEM; + } _ore_add_sg_seg(per_dev, cur_len, true); } else { struct __stripe_pages_2d *sp2d = ios->sp2d; -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4] ore: Must support none-PAGE-aligned IO 2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh ` (2 preceding siblings ...) 2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh @ 2012-01-06 14:46 ` Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh 3 siblings, 2 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-06 14:46 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/exofs/ore_raid.c | 71 ++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 59 insertions(+), 12 deletions(-) diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c index 414a2df..b3047ef 100644 --- a/fs/exofs/ore_raid.c +++ b/fs/exofs/ore_raid.c @@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios) /* @si contains info of the to-be-inserted page. Update of @si should be * maintained by caller. Specificaly si->dev, si->obj_offset, ... */ -static int _add_to_read_4_write(struct ore_io_state *ios, - struct ore_striping_info *si, struct page *page) +static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si, + struct page *page, unsigned pg_len) { struct request_queue *q; struct ore_per_dev_state *per_dev; @@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios, _ore_add_sg_seg(per_dev, gap, true); } q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev)); - added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0); - if (unlikely(added_len != PAGE_SIZE)) { + added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0); + if (unlikely(added_len != pg_len)) { ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n", per_dev->bio->bi_vcnt); return -ENOMEM; } - per_dev->length += PAGE_SIZE; + per_dev->length += pg_len; return 0; } +/* read the beginning of an unaligned first page */ +static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page) +{ + struct ore_striping_info si; + unsigned pg_len; + + ore_calc_stripe_info(ios->layout, ios->offset, 0, &si); + + pg_len = si.obj_offset % PAGE_SIZE; + si.obj_offset -= pg_len; + + ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n", + _LLU(si.obj_offset), pg_len, page->index, si.dev); + + return _add_to_r4w(ios, &si, page, pg_len); +} + +/* read the end of an incomplete last page */ +static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset) +{ + struct ore_striping_info si; + struct page *page; + unsigned pg_len, p, c; + + ore_calc_stripe_info(ios->layout, *offset, 0, &si); + + p = si.unit_off / PAGE_SIZE; + c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1, + ios->layout->mirrors_p1, si.par_dev, si.dev); + page = ios->sp2d->_1p_stripes[p].pages[c]; + + pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE); + *offset += pg_len; + + ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n", + p, c, _LLU(*offset), pg_len, si.dev, si.par_dev); + + BUG_ON(!page); + + return _add_to_r4w(ios, &si, page, pg_len); +} + static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret) { struct bio_vec *bv; @@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios) struct page **pp = &_1ps->pages[c]; bool uptodate; - if (*pp) + if (*pp) { + if (ios->offset % PAGE_SIZE) + /* Read the remainder of the page */ + _add_to_r4w_first_page(ios, *pp); /* to-be-written pages start here */ goto read_last_stripe; + } *pp = ios->r4w->get_page(ios->private, offset, &uptodate); @@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios) return -ENOMEM; if (!uptodate) - _add_to_read_4_write(ios, &read_si, *pp); + _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE); /* Mark read-pages to be cache_released */ _1ps->page_is_read[c] = true; @@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios) } read_last_stripe: - offset = ios->offset + (ios->length + PAGE_SIZE - 1) / - PAGE_SIZE * PAGE_SIZE; + offset = ios->offset + ios->length; + if (offset % PAGE_SIZE) + _add_to_r4w_last_page(ios, &offset); + /* offset will be aligned to next page */ + last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe) * bytes_in_stripe; if (offset == last_stripe_end) /* Optimize for the aligned case */ @@ -503,7 +552,7 @@ read_last_stripe: /* Mark read-pages to be cache_released */ _1ps->page_is_read[c] = true; if (!uptodate) - _add_to_read_4_write(ios, &read_si, page); + _add_to_r4w(ios, &read_si, page, PAGE_SIZE); } offset += PAGE_SIZE; @@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios) return -ENOMEM; } - BUG_ON(ios->offset % PAGE_SIZE); - /* Round io down to last full strip */ first_stripe = div_u64(ios->offset, stripe_size); last_stripe = div_u64(ios->offset + ios->length, stripe_size); -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4 ver2] ore: Must support none-PAGE-aligned IO 2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh @ 2012-01-08 8:50 ` Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh 1 sibling, 0 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-08 8:50 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/exofs/ore_raid.c | 72 ++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 60 insertions(+), 12 deletions(-) diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c index 414a2df..d222c77 100644 --- a/fs/exofs/ore_raid.c +++ b/fs/exofs/ore_raid.c @@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios) /* @si contains info of the to-be-inserted page. Update of @si should be * maintained by caller. Specificaly si->dev, si->obj_offset, ... */ -static int _add_to_read_4_write(struct ore_io_state *ios, - struct ore_striping_info *si, struct page *page) +static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si, + struct page *page, unsigned pg_len) { struct request_queue *q; struct ore_per_dev_state *per_dev; @@ -366,17 +366,60 @@ static int _add_to_read_4_write(struct ore_io_state *ios, _ore_add_sg_seg(per_dev, gap, true); } q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev)); - added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0); - if (unlikely(added_len != PAGE_SIZE)) { + added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, + si->obj_offset % PAGE_SIZE); + if (unlikely(added_len != pg_len)) { ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n", per_dev->bio->bi_vcnt); return -ENOMEM; } - per_dev->length += PAGE_SIZE; + per_dev->length += pg_len; return 0; } +/* read the beginning of an unaligned first page */ +static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page) +{ + struct ore_striping_info si; + unsigned pg_len; + + ore_calc_stripe_info(ios->layout, ios->offset, 0, &si); + + pg_len = si.obj_offset % PAGE_SIZE; + si.obj_offset -= pg_len; + + ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n", + _LLU(si.obj_offset), pg_len, page->index, si.dev); + + return _add_to_r4w(ios, &si, page, pg_len); +} + +/* read the end of an incomplete last page */ +static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset) +{ + struct ore_striping_info si; + struct page *page; + unsigned pg_len, p, c; + + ore_calc_stripe_info(ios->layout, *offset, 0, &si); + + p = si.unit_off / PAGE_SIZE; + c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1, + ios->layout->mirrors_p1, si.par_dev, si.dev); + page = ios->sp2d->_1p_stripes[p].pages[c]; + + pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE); + *offset += pg_len; + + ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n", + p, c, _LLU(*offset), pg_len, si.dev, si.par_dev); + + BUG_ON(!page); + + return _add_to_r4w(ios, &si, page, pg_len); +} + static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret) { struct bio_vec *bv; @@ -444,9 +487,13 @@ static int _read_4_write(struct ore_io_state *ios) struct page **pp = &_1ps->pages[c]; bool uptodate; - if (*pp) + if (*pp) { + if (ios->offset % PAGE_SIZE) + /* Read the remainder of the page */ + _add_to_r4w_first_page(ios, *pp); /* to-be-written pages start here */ goto read_last_stripe; + } *pp = ios->r4w->get_page(ios->private, offset, &uptodate); @@ -454,7 +501,7 @@ static int _read_4_write(struct ore_io_state *ios) return -ENOMEM; if (!uptodate) - _add_to_read_4_write(ios, &read_si, *pp); + _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE); /* Mark read-pages to be cache_released */ _1ps->page_is_read[c] = true; @@ -465,8 +512,11 @@ static int _read_4_write(struct ore_io_state *ios) } read_last_stripe: - offset = ios->offset + (ios->length + PAGE_SIZE - 1) / - PAGE_SIZE * PAGE_SIZE; + offset = ios->offset + ios->length; + if (offset % PAGE_SIZE) + _add_to_r4w_last_page(ios, &offset); + /* offset will be aligned to next page */ + last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe) * bytes_in_stripe; if (offset == last_stripe_end) /* Optimize for the aligned case */ @@ -503,7 +553,7 @@ read_last_stripe: /* Mark read-pages to be cache_released */ _1ps->page_is_read[c] = true; if (!uptodate) - _add_to_read_4_write(ios, &read_si, page); + _add_to_r4w(ios, &read_si, page, PAGE_SIZE); } offset += PAGE_SIZE; @@ -616,8 +666,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios) return -ENOMEM; } - BUG_ON(ios->offset % PAGE_SIZE); - /* Round io down to last full strip */ first_stripe = div_u64(ios->offset, stripe_size); last_stripe = div_u64(ios->offset + ios->length, stripe_size); -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] ore: Must support none-PAGE-aligned IO 2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh @ 2012-01-08 8:50 ` Boaz Harrosh 1 sibling, 0 replies; 8+ messages in thread From: Boaz Harrosh @ 2012-01-08 8:50 UTC (permalink / raw) To: linux-fsdevel, open-osd, Stable Tree On 01/06/2012 04:46 PM, Boaz Harrosh wrote: > > NFS might send us offsets that are not PAGE aligned. So > we must read in the reminder of the first/last pages, in cases > we need it for Parity calculations. > > We only add an sg segments to read the partial page. But > we don't mark it as read=true because it is a lock-for-write > page. > > TODO: In some cases (IO spans a single unit) we can just > adjust the raid_unit offset/length, but this is left for > later Kernels. > > [Bug in 3.2.0 Kernel] > CC: Stable Tree <stable@kernel.org> > Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> This patch had a data corruption bug. I'll post a version 2 Here is the diff of ver2 from ver1 --- diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c index b3047ef..d222c77 100644 --- a/fs/exofs/ore_raid.c +++ b/fs/exofs/ore_raid.c @@ -366,7 +366,8 @@ static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si, _ore_add_sg_seg(per_dev, gap, true); } q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev)); - added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0); + added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, + si->obj_offset % PAGE_SIZE); if (unlikely(added_len != pg_len)) { ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n", per_dev->bio->bi_vcnt); > --- > fs/exofs/ore_raid.c | 71 ++++++++++++++++++++++++++++++++++++++++++-------- > 1 files changed, 59 insertions(+), 12 deletions(-) > > diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c > index 414a2df..b3047ef 100644 > --- a/fs/exofs/ore_raid.c > +++ b/fs/exofs/ore_raid.c > @@ -328,8 +328,8 @@ static int _alloc_read_4_write(struct ore_io_state *ios) > /* @si contains info of the to-be-inserted page. Update of @si should be > * maintained by caller. Specificaly si->dev, si->obj_offset, ... > */ > -static int _add_to_read_4_write(struct ore_io_state *ios, > - struct ore_striping_info *si, struct page *page) > +static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si, > + struct page *page, unsigned pg_len) > { > struct request_queue *q; > struct ore_per_dev_state *per_dev; > @@ -366,17 +366,59 @@ static int _add_to_read_4_write(struct ore_io_state *ios, > _ore_add_sg_seg(per_dev, gap, true); > } > q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev)); > - added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0); > - if (unlikely(added_len != PAGE_SIZE)) { > + added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len, 0); > + if (unlikely(added_len != pg_len)) { > ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n", > per_dev->bio->bi_vcnt); > return -ENOMEM; > } > > - per_dev->length += PAGE_SIZE; > + per_dev->length += pg_len; > return 0; > } > > +/* read the beginning of an unaligned first page */ > +static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page) > +{ > + struct ore_striping_info si; > + unsigned pg_len; > + > + ore_calc_stripe_info(ios->layout, ios->offset, 0, &si); > + > + pg_len = si.obj_offset % PAGE_SIZE; > + si.obj_offset -= pg_len; > + > + ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n", > + _LLU(si.obj_offset), pg_len, page->index, si.dev); > + > + return _add_to_r4w(ios, &si, page, pg_len); > +} > + > +/* read the end of an incomplete last page */ > +static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset) > +{ > + struct ore_striping_info si; > + struct page *page; > + unsigned pg_len, p, c; > + > + ore_calc_stripe_info(ios->layout, *offset, 0, &si); > + > + p = si.unit_off / PAGE_SIZE; > + c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1, > + ios->layout->mirrors_p1, si.par_dev, si.dev); > + page = ios->sp2d->_1p_stripes[p].pages[c]; > + > + pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE); > + *offset += pg_len; > + > + ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n", > + p, c, _LLU(*offset), pg_len, si.dev, si.par_dev); > + > + BUG_ON(!page); > + > + return _add_to_r4w(ios, &si, page, pg_len); > +} > + > static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret) > { > struct bio_vec *bv; > @@ -444,9 +486,13 @@ static int _read_4_write(struct ore_io_state *ios) > struct page **pp = &_1ps->pages[c]; > bool uptodate; > > - if (*pp) > + if (*pp) { > + if (ios->offset % PAGE_SIZE) > + /* Read the remainder of the page */ > + _add_to_r4w_first_page(ios, *pp); > /* to-be-written pages start here */ > goto read_last_stripe; > + } > > *pp = ios->r4w->get_page(ios->private, offset, > &uptodate); > @@ -454,7 +500,7 @@ static int _read_4_write(struct ore_io_state *ios) > return -ENOMEM; > > if (!uptodate) > - _add_to_read_4_write(ios, &read_si, *pp); > + _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE); > > /* Mark read-pages to be cache_released */ > _1ps->page_is_read[c] = true; > @@ -465,8 +511,11 @@ static int _read_4_write(struct ore_io_state *ios) > } > > read_last_stripe: > - offset = ios->offset + (ios->length + PAGE_SIZE - 1) / > - PAGE_SIZE * PAGE_SIZE; > + offset = ios->offset + ios->length; > + if (offset % PAGE_SIZE) > + _add_to_r4w_last_page(ios, &offset); > + /* offset will be aligned to next page */ > + > last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe) > * bytes_in_stripe; > if (offset == last_stripe_end) /* Optimize for the aligned case */ > @@ -503,7 +552,7 @@ read_last_stripe: > /* Mark read-pages to be cache_released */ > _1ps->page_is_read[c] = true; > if (!uptodate) > - _add_to_read_4_write(ios, &read_si, page); > + _add_to_r4w(ios, &read_si, page, PAGE_SIZE); > } > > offset += PAGE_SIZE; > @@ -616,8 +665,6 @@ int _ore_post_alloc_raid_stuff(struct ore_io_state *ios) > return -ENOMEM; > } > > - BUG_ON(ios->offset % PAGE_SIZE); > - > /* Round io down to last full strip */ > first_stripe = div_u64(ios->offset, stripe_size); > last_stripe = div_u64(ios->offset + ios->length, stripe_size); ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-01-08 8:50 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-06 14:37 [PATCHSET 0/4] ore: Kernel 3.3 BUG squashing (Also for 3.2 Stable@) Boaz Harrosh 2012-01-06 14:40 ` [PATCH 1/4] ore: FIX breakage when MISC_FILESYSTEMS is not set Boaz Harrosh 2012-01-07 18:19 ` Randy Dunlap 2012-01-06 14:42 ` [PATCH 2/4] ore: Fix crash in case of an IO error Boaz Harrosh 2012-01-06 14:43 ` [PATCH 3/4] ore: fix BUG_ON, too few sgs when reading Boaz Harrosh 2012-01-06 14:46 ` [PATCH 4/4] ore: Must support none-PAGE-aligned IO Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4 ver2] " Boaz Harrosh 2012-01-08 8:50 ` [PATCH 4/4] " Boaz Harrosh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).