* O_DIRECT change
@ 2011-06-03 16:48 Sage Weil
2011-06-04 2:29 ` Henry C Chang
0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2011-06-03 16:48 UTC (permalink / raw)
To: henry.cy.chang; +Cc: ceph-devel
Hi Henry,
I made a small change to the O_DIRECT path to zero holes properly in
commit 85defe7 (below). Do you mind reviewing the change, and/or testing,
since you are the main O_DIRECT user? The test case that was failing is
here:
http://tracker.newdream.net/issues/1096#note-19
The problem was that the read coming down from the VFS isn't trimmed to
i_size, so the old zero tail check wasn't true, and we would set
*checkeof. Then ceph_aio_read would getattr and loop, since we
didn't actually read eof (due to a hole).
Actually, I suspect the *checkeof part is still incorrect... does the
zeroing part at least look right to you?
Thanks!
sage
From 85defe76f7e2a0b3d285a3be72fcffce96629b5c Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@newdream.net>
Date: Wed, 1 Jun 2011 16:08:44 -0700
Subject: [PATCH] ceph: fix short sync reads from the OSD
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer. For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.
Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
---
fs/ceph/file.c | 28 +++++++++++++++-------------
1 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8c5ac4e..b654f40 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -283,7 +283,7 @@ int ceph_release(struct inode *inode, struct file *file)
static int striped_read(struct inode *inode,
u64 off, u64 len,
struct page **pages, int num_pages,
- int *checkeof, bool align_to_pages,
+ int *checkeof, bool o_direct,
unsigned long buf_align)
{
struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
@@ -308,7 +308,7 @@ static int striped_read(struct inode *inode,
io_align = off & ~PAGE_MASK;
more:
- if (align_to_pages)
+ if (o_direct)
page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
else
page_align = pos & ~PAGE_MASK;
@@ -346,20 +346,22 @@ more:
}
if (was_short) {
- /* was original extent fully inside i_size? */
- if (pos + left <= inode->i_size) {
- dout("zero tail\n");
- ceph_zero_page_vector_range(page_off + read, len - read,
+ /* did we bounce off eof? */
+ if (pos + left > inode->i_size)
+ *checkeof = 1;
+
+ /* zero trailing bytes (inside i_size) */
+ if (left > 0 && pos < inode->i_size) {
+ if (pos + left > inode->i_size)
+ left = inode->i_size - pos;
+
+ dout("zero tail %d\n", left);
+ ceph_zero_page_vector_range(page_off + read, left,
pages);
- read = len;
- goto out;
+ read += left;
}
-
- /* check i_size */
- *checkeof = 1;
}
-out:
if (ret >= 0)
ret = read;
dout("striped_read returns %d\n", ret);
@@ -659,7 +661,7 @@ out:
/* hit EOF or hole? */
if (statret == 0 && *ppos < inode->i_size) {
- dout("aio_read sync_read hit hole, reading more\n");
+ dout("aio_read sync_read hit hole, ppos %lld < size %lld, reading more\n", *ppos, inode->i_size);
read += ret;
base += ret;
len -= ret;
--
1.7.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: O_DIRECT change
2011-06-03 16:48 O_DIRECT change Sage Weil
@ 2011-06-04 2:29 ` Henry C Chang
2011-06-07 9:39 ` Henry C Chang
0 siblings, 1 reply; 4+ messages in thread
From: Henry C Chang @ 2011-06-04 2:29 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Hi Sage,
We are on Dragon Boat Festival holidays. :) I will take a close look
and test it next week.
--
Henry
2011/6/4 Sage Weil <sage@newdream.net>:
> Hi Henry,
>
> I made a small change to the O_DIRECT path to zero holes properly in
> commit 85defe7 (below). Do you mind reviewing the change, and/or testing,
> since you are the main O_DIRECT user? The test case that was failing is
> here:
>
> http://tracker.newdream.net/issues/1096#note-19
>
> The problem was that the read coming down from the VFS isn't trimmed to
> i_size, so the old zero tail check wasn't true, and we would set
> *checkeof. Then ceph_aio_read would getattr and loop, since we
> didn't actually read eof (due to a hole).
>
> Actually, I suspect the *checkeof part is still incorrect... does the
> zeroing part at least look right to you?
>
> Thanks!
> sage
>
>
> From 85defe76f7e2a0b3d285a3be72fcffce96629b5c Mon Sep 17 00:00:00 2001
> From: Sage Weil <sage@newdream.net>
> Date: Wed, 1 Jun 2011 16:08:44 -0700
> Subject: [PATCH] ceph: fix short sync reads from the OSD
>
> If we get a short read from the OSD because the object is small, we need to
> zero the remainder of the buffer. For O_DIRECT reads, the attempted range
> is not trimmed to i_size by the VFS, so we were actually looping
> indefinitely.
>
> Fix by trimming by i_size, and the unconditionally zeroing the trailing
> range.
>
> Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
> fs/ceph/file.c | 28 +++++++++++++++-------------
> 1 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 8c5ac4e..b654f40 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -283,7 +283,7 @@ int ceph_release(struct inode *inode, struct file *file)
> static int striped_read(struct inode *inode,
> u64 off, u64 len,
> struct page **pages, int num_pages,
> - int *checkeof, bool align_to_pages,
> + int *checkeof, bool o_direct,
> unsigned long buf_align)
> {
> struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
> @@ -308,7 +308,7 @@ static int striped_read(struct inode *inode,
> io_align = off & ~PAGE_MASK;
>
> more:
> - if (align_to_pages)
> + if (o_direct)
> page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
> else
> page_align = pos & ~PAGE_MASK;
> @@ -346,20 +346,22 @@ more:
> }
>
> if (was_short) {
> - /* was original extent fully inside i_size? */
> - if (pos + left <= inode->i_size) {
> - dout("zero tail\n");
> - ceph_zero_page_vector_range(page_off + read, len - read,
> + /* did we bounce off eof? */
> + if (pos + left > inode->i_size)
> + *checkeof = 1;
> +
> + /* zero trailing bytes (inside i_size) */
> + if (left > 0 && pos < inode->i_size) {
> + if (pos + left > inode->i_size)
> + left = inode->i_size - pos;
> +
> + dout("zero tail %d\n", left);
> + ceph_zero_page_vector_range(page_off + read, left,
> pages);
> - read = len;
> - goto out;
> + read += left;
> }
> -
> - /* check i_size */
> - *checkeof = 1;
> }
>
> -out:
> if (ret >= 0)
> ret = read;
> dout("striped_read returns %d\n", ret);
> @@ -659,7 +661,7 @@ out:
>
> /* hit EOF or hole? */
> if (statret == 0 && *ppos < inode->i_size) {
> - dout("aio_read sync_read hit hole, reading more\n");
> + dout("aio_read sync_read hit hole, ppos %lld < size %lld, reading more\n", *ppos, inode->i_size);
> read += ret;
> base += ret;
> len -= ret;
> --
> 1.7.0
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: O_DIRECT change
2011-06-04 2:29 ` Henry C Chang
@ 2011-06-07 9:39 ` Henry C Chang
2011-06-08 5:07 ` Sage Weil
0 siblings, 1 reply; 4+ messages in thread
From: Henry C Chang @ 2011-06-07 9:39 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Hi Sage,
I checked the stripe_read function and think the following two patches
are needed:
1. Move hit_stripe/was_short checking after the adjustment of
ceph_osdc_readpages return code
Fix the following case:
(i) Create a sparse file
dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
(ii) Read the file
dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2
iflag=direct
diff --git a/ceph/file.c b/ceph/file.c
index 1f36e2c..6e6297a 100644
--- a/ceph/file.c
+++ b/ceph/file.c
@@ -313,16 +313,18 @@ more:
page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
else
page_align = pos & ~PAGE_MASK;
+
this_len = left;
ret = ceph_osdc_readpages(&fsc->client->osdc, ceph_vino(inode),
&ci->i_layout, pos, &this_len,
ci->i_truncate_seq,
ci->i_truncate_size,
page_pos, pages_left, page_align);
- hit_stripe = this_len < left;
- was_short = ret >= 0 && ret < this_len;
if (ret == -ENOENT)
ret = 0;
+
+ hit_stripe = this_len < left;
+ was_short = ret >= 0 && ret < this_len;
dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
2. Fix didpages and the starting position of ceph_zero_page_vector_range
This fixes segfault caused by the following scenario:
(i) generate a sparse file by
dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1
(ii) read the file from offset 4194300~500
dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1
diff --git a/ceph/file.c b/ceph/file.c
index 6e6297a..d7932bc 100644
--- a/ceph/file.c
+++ b/ceph/file.c
@@ -291,7 +291,6 @@ static int striped_read(struct inode *inode,
struct ceph_inode_info *ci = ceph_inode(inode);
u64 pos, this_len;
int io_align, page_align;
- int page_off = off & ~PAGE_CACHE_MASK; /* first byte's offset in page */
int left, pages_left;
int read;
struct page **page_pos;
@@ -329,12 +328,11 @@ more:
ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
if (ret > 0) {
- int didpages =
- ((pos & ~PAGE_CACHE_MASK) + ret) >> PAGE_CACHE_SHIFT;
+ int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
if (read < pos - off) {
dout(" zero gap %llu to %llu\n", off + read, pos);
- ceph_zero_page_vector_range(page_off + read,
+ ceph_zero_page_vector_range(page_align + read,
pos - off - read, pages);
}
pos += ret;
@@ -359,7 +357,7 @@ more:
left = inode->i_size - pos;
dout("zero tail %d\n", left);
- ceph_zero_page_vector_range(page_off + read, left,
+ ceph_zero_page_vector_range(page_align + read, left,
pages);
read += left;
}
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: O_DIRECT change
2011-06-07 9:39 ` Henry C Chang
@ 2011-06-08 5:07 ` Sage Weil
0 siblings, 0 replies; 4+ messages in thread
From: Sage Weil @ 2011-06-08 5:07 UTC (permalink / raw)
To: Henry C Chang; +Cc: ceph-devel
On Tue, 7 Jun 2011, Henry C Chang wrote:
> Hi Sage,
>
> I checked the stripe_read function and think the following two patches
> are needed:
>
> 1. Move hit_stripe/was_short checking after the adjustment of
> ceph_osdc_readpages return code
>
> Fix the following case:
> (i) Create a sparse file
> dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
>
> (ii) Read the file
> dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2
> iflag=direct
>
> diff --git a/ceph/file.c b/ceph/file.c
> index 1f36e2c..6e6297a 100644
> --- a/ceph/file.c
> +++ b/ceph/file.c
> @@ -313,16 +313,18 @@ more:
> page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
> else
> page_align = pos & ~PAGE_MASK;
> +
> this_len = left;
> ret = ceph_osdc_readpages(&fsc->client->osdc, ceph_vino(inode),
> &ci->i_layout, pos, &this_len,
> ci->i_truncate_seq,
> ci->i_truncate_size,
> page_pos, pages_left, page_align);
> - hit_stripe = this_len < left;
> - was_short = ret >= 0 && ret < this_len;
> if (ret == -ENOENT)
> ret = 0;
> +
> + hit_stripe = this_len < left;
> + was_short = ret >= 0 && ret < this_len;
> dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
> ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
Good catch. Applied.
> 2. Fix didpages and the starting position of ceph_zero_page_vector_range
>
> This fixes segfault caused by the following scenario:
> (i) generate a sparse file by
> dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1
> (ii) read the file from offset 4194300~500
> dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1
>
> diff --git a/ceph/file.c b/ceph/file.c
> index 6e6297a..d7932bc 100644
> --- a/ceph/file.c
> +++ b/ceph/file.c
> @@ -291,7 +291,6 @@ static int striped_read(struct inode *inode,
> struct ceph_inode_info *ci = ceph_inode(inode);
> u64 pos, this_len;
> int io_align, page_align;
> - int page_off = off & ~PAGE_CACHE_MASK; /* first byte's offset in page */
> int left, pages_left;
> int read;
> struct page **page_pos;
> @@ -329,12 +328,11 @@ more:
> ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
>
> if (ret > 0) {
> - int didpages =
> - ((pos & ~PAGE_CACHE_MASK) + ret) >> PAGE_CACHE_SHIFT;
> + int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>
> if (read < pos - off) {
> dout(" zero gap %llu to %llu\n", off + read, pos);
> - ceph_zero_page_vector_range(page_off + read,
> + ceph_zero_page_vector_range(page_align + read,
> pos - off - read, pages);
> }
> pos += ret;
> @@ -359,7 +357,7 @@ more:
> left = inode->i_size - pos;
>
> dout("zero tail %d\n", left);
> - ceph_zero_page_vector_range(page_off + read, left,
> + ceph_zero_page_vector_range(page_align + read, left,
> pages);
> read += left;
> }
Sigh, the alignment adjustments still make my head hurt... Did you test
this with O_DIRECT as well? IIRC, the last time I worked on this code the
problem was with O_DIRECT when the memory alignment didn't match the file
offset. Something like
- allocate buffer with page + 512 alignment
- read from offset 1024
I think what we really need is a series of tests that verify we get
correct data with all combinations of file alignment, buffer alignment,
O_DIRECT and sync (multiclient) io. Probably that means writing with
O_DIRECT, reading with buffered IO, and then writing with buffered IO and
reading with O_DIRECT. We should also test the same thing across object
boundaries. I wonder if xfstests has something?
Unfortunately it's somewhat annoying to force the client into normal sync
mode with a single client (you have to mount -o sync). I wonder if it's
worth adding an ioctl to set a flag on the fd to do this.
I can't find the test program I was using before, although I also remember
not being happy with it... :/
Anyway, opening up an item in the tracker for this, #1147.
sage
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-06-08 5:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-03 16:48 O_DIRECT change Sage Weil
2011-06-04 2:29 ` Henry C Chang
2011-06-07 9:39 ` Henry C Chang
2011-06-08 5:07 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.