[PATCH 0/3] ceph: misc fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] ceph: misc fixes
@ 2013-02-22  2:08 Yan, Zheng
  2013-02-22  2:08 ` [PATCH 1/3] ceph: fix LSSNAP regression Yan, Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yan, Zheng @ 2013-02-22  2:08 UTC (permalink / raw)
  To: ceph-devel, sage

These patches are also in:
  git://github.com/ukernel/linux.git wip-ceph

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] ceph: fix LSSNAP regression
  2013-02-22  2:08 [PATCH 0/3] ceph: misc fixes Yan, Zheng
@ 2013-02-22  2:08 ` Yan, Zheng
  2013-02-22  2:08 ` [PATCH 2/3] ceph: queue cap release when trimming cap Yan, Zheng
  2013-02-22  2:08 ` [PATCH 3/3] ceph: fix vmtruncate deadlock Yan, Zheng
  2 siblings, 0 replies; 11+ messages in thread
From: Yan, Zheng @ 2013-02-22  2:08 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

commit 6e8575faa8 makes parse_reply_info_extra() return -EIO for LSSNAP

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/mds_client.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d958420..608ffcf 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -265,7 +265,8 @@ static int parse_reply_info_extra(void **p, void *end,
 {
 	if (info->head->op == CEPH_MDS_OP_GETFILELOCK)
 		return parse_reply_info_filelock(p, end, info, features);
-	else if (info->head->op == CEPH_MDS_OP_READDIR)
+	else if (info->head->op == CEPH_MDS_OP_READDIR ||
+		 info->head->op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_dir(p, end, info, features);
 	else if (info->head->op == CEPH_MDS_OP_CREATE)
 		return parse_reply_info_create(p, end, info, features);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] ceph: queue cap release when trimming cap
  2013-02-22  2:08 [PATCH 0/3] ceph: misc fixes Yan, Zheng
  2013-02-22  2:08 ` [PATCH 1/3] ceph: fix LSSNAP regression Yan, Zheng
@ 2013-02-22  2:08 ` Yan, Zheng
  2013-02-22  2:08 ` [PATCH 3/3] ceph: fix vmtruncate deadlock Yan, Zheng
  2 siblings, 0 replies; 11+ messages in thread
From: Yan, Zheng @ 2013-02-22  2:08 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

So the client will later send cap release message to MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/caps.c       | 6 +++---
 fs/ceph/mds_client.c | 2 ++
 fs/ceph/super.h      | 2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 1e1e020..5d5c32b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -997,9 +997,9 @@ static int send_cap_msg(struct ceph_mds_session *session,
 	return 0;
 }
 
-static void __queue_cap_release(struct ceph_mds_session *session,
-				u64 ino, u64 cap_id, u32 migrate_seq,
-				u32 issue_seq)
+void __queue_cap_release(struct ceph_mds_session *session,
+			 u64 ino, u64 cap_id, u32 migrate_seq,
+			 u32 issue_seq)
 {
 	struct ceph_msg *msg;
 	struct ceph_mds_cap_release *head;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 608ffcf..ccc68b0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1197,6 +1197,8 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 	session->s_trim_caps--;
 	if (oissued) {
 		/* we aren't the only cap.. just remove us */
+		__queue_cap_release(session, ceph_ino(inode), cap->cap_id,
+				    cap->mseq, cap->issue_seq);
 		__ceph_remove_cap(cap);
 	} else {
 		/* try to drop referring dentries */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 66ebe72..e1eb50a 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -755,6 +755,8 @@ static inline void ceph_remove_cap(struct ceph_cap *cap)
 extern void ceph_put_cap(struct ceph_mds_client *mdsc,
 			 struct ceph_cap *cap);
 
+extern void __queue_cap_release(struct ceph_mds_session *session, u64 ino,
+				u64 cap_id, u32 migrate_seq, u32 issue_seq);
 extern void ceph_queue_caps_release(struct inode *inode);
 extern int ceph_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern int ceph_fsync(struct file *file, loff_t start, loff_t end,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-22  2:08 [PATCH 0/3] ceph: misc fixes Yan, Zheng
  2013-02-22  2:08 ` [PATCH 1/3] ceph: fix LSSNAP regression Yan, Zheng
  2013-02-22  2:08 ` [PATCH 2/3] ceph: queue cap release when trimming cap Yan, Zheng
@ 2013-02-22  2:08 ` Yan, Zheng
  2013-02-22 18:54   ` Gregory Farnum
  2 siblings, 1 reply; 11+ messages in thread
From: Yan, Zheng @ 2013-02-22  2:08 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

If there is pending truncation, ceph_get_caps() waits for a work thread
to apply it. To apply pending truncation, the work thread needs acquire
i_mutex. But in the buffered write case, ceph_get_caps() is called while
i_mutex is locked. So the writer and the work thread wait for each other

The fix is make ceph_get_caps() not wait, apply pending truncation in
ceph_aio_write() instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/caps.c |  6 ------
 fs/ceph/file.c | 15 ++++++++++-----
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5d5c32b..b9d8417 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
 	}
 	have = __ceph_caps_issued(ci, &implemented);
 
-	/*
-	 * disallow writes while a truncate is pending
-	 */
-	if (ci->i_truncate_pending)
-		have &= ~CEPH_CAP_FILE_WR;
-
 	if ((have & need) == need) {
 		/*
 		 * Look at (implemented & ~have & not) so that we keep waiting
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index a1e5b81..bf7849a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
 	     inode, ceph_vinop(inode), pos, (unsigned)len, inode);
 again:
-	__ceph_do_pending_vmtruncate(inode);
 	if (fi->fmode & CEPH_FILE_MODE_LAZY)
 		want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
 	else
@@ -724,11 +723,13 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (ceph_snap(inode) != CEPH_NOSNAP)
 		return -EROFS;
 
+	sb_start_write(inode->i_sb);
 retry_snap:
 	written = 0;
-	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
-		return -ENOSPC;
-	__ceph_do_pending_vmtruncate(inode);
+	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
+		ret = -ENOSPC;
+		goto out;
+	}
 
 	/*
 	 * try to do a buffered write.  if we don't have sufficient
@@ -738,7 +739,10 @@ retry_snap:
 	if (!(iocb->ki_filp->f_flags & O_DIRECT) &&
 	    !(inode->i_sb->s_flags & MS_SYNCHRONOUS) &&
 	    !(fi->flags & CEPH_F_SYNC)) {
-		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+		mutex_lock(&inode->i_mutex);
+		__ceph_do_pending_vmtruncate(inode);
+		ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
+		mutex_unlock(&inode->i_mutex);
 		if (ret >= 0)
 			written = ret;
 
@@ -783,6 +787,7 @@ out:
 		     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
 		goto retry_snap;
 	}
+	sb_end_write(inode->i_sb);
 
 	return ret;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-22  2:08 ` [PATCH 3/3] ceph: fix vmtruncate deadlock Yan, Zheng
@ 2013-02-22 18:54   ` Gregory Farnum
  2013-02-23  4:31     ` Yan, Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-02-22 18:54 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Sage Weil

I haven't spent that much time in the kernel client, but this patch
isn't working out for me. In particular, I'm pretty sure we need to
preserve this:

> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 5d5c32b..b9d8417 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>         }
>         have = __ceph_caps_issued(ci, &implemented);
>
> -       /*
> -        * disallow writes while a truncate is pending
> -        */
> -       if (ci->i_truncate_pending)
> -               have &= ~CEPH_CAP_FILE_WR;
> -
>         if ((have & need) == need) {
>                 /*
>                  * Look at (implemented & ~have & not) so that we keep waiting

Because if there's a pending truncate, we really can't write. You do
handle it in the particular case of doing buffered file writes, but
these caps are a marker of permission, and the client shouldn't have
write permission to a file until it's up to date on the truncates. Or
am I misunderstanding something?


> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index a1e5b81..bf7849a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>  again:
> -       __ceph_do_pending_vmtruncate(inode);

There doesn't seem to be any kind of replacement for this? We need to
do any pending truncates before reading or we might get stale data
back.

The first two in the series look good to me, but I'll wait and pull
them in as a unit with this one once we're done discussing. :)
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-22 18:54   ` Gregory Farnum
@ 2013-02-23  4:31     ` Yan, Zheng
  2013-02-26  0:01       ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Yan, Zheng @ 2013-02-23  4:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

On 02/23/2013 02:54 AM, Gregory Farnum wrote:
> I haven't spent that much time in the kernel client, but this patch
> isn't working out for me. In particular, I'm pretty sure we need to
> preserve this:
> 
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 5d5c32b..b9d8417 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>         }
>>         have = __ceph_caps_issued(ci, &implemented);
>>
>> -       /*
>> -        * disallow writes while a truncate is pending
>> -        */
>> -       if (ci->i_truncate_pending)
>> -               have &= ~CEPH_CAP_FILE_WR;
>> -
>>         if ((have & need) == need) {
>>                 /*
>>                  * Look at (implemented & ~have & not) so that we keep waiting
> 
> Because if there's a pending truncate, we really can't write. You do
> handle it in the particular case of doing buffered file writes, but
> these caps are a marker of permission, and the client shouldn't have
> write permission to a file until it's up to date on the truncates. Or
> am I misunderstanding something?

pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
this patch only affects situation that clients receives truncate request from MDS in the
middle of write. I think it's better to do vmtruncate after write finishes. 

> 
> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index a1e5b81..bf7849a 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>  again:
>> -       __ceph_do_pending_vmtruncate(inode);
> 
> There doesn't seem to be any kind of replacement for this? We need to
> do any pending truncates before reading or we might get stale data
> back.

generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
bug, that's the reason I remove it.

Regards
Yan, Zheng

> 
> The first two in the series look good to me, but I'll wait and pull
> them in as a unit with this one once we're done discussing. :)
> -Greg
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-23  4:31     ` Yan, Zheng
@ 2013-02-26  0:01       ` Gregory Farnum
  2013-02-26  2:15         ` Yan, Zheng
  2013-02-26 18:19         ` Gregory Farnum
  0 siblings, 2 replies; 11+ messages in thread
From: Gregory Farnum @ 2013-02-26  0:01 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Sage Weil

On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> On 02/23/2013 02:54 AM, Gregory Farnum wrote:
>> I haven't spent that much time in the kernel client, but this patch
>> isn't working out for me. In particular, I'm pretty sure we need to
>> preserve this:
>>
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index 5d5c32b..b9d8417 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>>         }
>>>         have = __ceph_caps_issued(ci, &implemented);
>>>
>>> -       /*
>>> -        * disallow writes while a truncate is pending
>>> -        */
>>> -       if (ci->i_truncate_pending)
>>> -               have &= ~CEPH_CAP_FILE_WR;
>>> -
>>>         if ((have & need) == need) {
>>>                 /*
>>>                  * Look at (implemented & ~have & not) so that we keep waiting
>>
>> Because if there's a pending truncate, we really can't write. You do
>> handle it in the particular case of doing buffered file writes, but
>> these caps are a marker of permission, and the client shouldn't have
>> write permission to a file until it's up to date on the truncates. Or
>> am I misunderstanding something?
>
> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
> this patch only affects situation that clients receives truncate request from MDS in the
> middle of write.
It looks to me like truncates can get queued for later, so that's not the case?
And how could the client receive a truncate while in the middle of
writing? Either it's got the write caps (in which case nobody else can
truncate), or it shouldn't be writing...I suppose if somebody enables
lazyIO that would do it, but again in that case I imagine we'd rather
do the truncate before writing:

> I think it's better to do vmtruncate after write finishes.
What if you're writing past the truncate point? Then your written data
would be truncated off even though it should have been a file
extension.


>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index a1e5b81..bf7849a 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>>  again:
>>> -       __ceph_do_pending_vmtruncate(inode);
>>
>> There doesn't seem to be any kind of replacement for this? We need to
>> do any pending truncates before reading or we might get stale data
>> back.
>
> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
> bug, that's the reason I remove it.

Oh, you're right about the locking (I think, anyway?). However, i_size
doesn't protect us — we might for instance have truncated to zero and
then back up to the original file size. :( But that doesn't look like
it's handled correctly anyway (racy) — am I misreading that badly or
is the infrastructure in fact broken in that case?

As I said, I don't spend much time in the kernel client, but I looked
around a bit more and this is what I came up with. Education, if
nothing else. ;)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-26  0:01       ` Gregory Farnum
@ 2013-02-26  2:15         ` Yan, Zheng
  2013-02-26  5:00           ` Sage Weil
  2013-02-26 18:19         ` Gregory Farnum
  1 sibling, 1 reply; 11+ messages in thread
From: Yan, Zheng @ 2013-02-26  2:15 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

On 02/26/2013 08:01 AM, Gregory Farnum wrote:
> On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> On 02/23/2013 02:54 AM, Gregory Farnum wrote:
>>> I haven't spent that much time in the kernel client, but this patch
>>> isn't working out for me. In particular, I'm pretty sure we need to
>>> preserve this:
>>>
>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>> index 5d5c32b..b9d8417 100644
>>>> --- a/fs/ceph/caps.c
>>>> +++ b/fs/ceph/caps.c
>>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>>>         }
>>>>         have = __ceph_caps_issued(ci, &implemented);
>>>>
>>>> -       /*
>>>> -        * disallow writes while a truncate is pending
>>>> -        */
>>>> -       if (ci->i_truncate_pending)
>>>> -               have &= ~CEPH_CAP_FILE_WR;
>>>> -
>>>>         if ((have & need) == need) {
>>>>                 /*
>>>>                  * Look at (implemented & ~have & not) so that we keep waiting
>>>
>>> Because if there's a pending truncate, we really can't write. You do
>>> handle it in the particular case of doing buffered file writes, but
>>> these caps are a marker of permission, and the client shouldn't have
>>> write permission to a file until it's up to date on the truncates. Or
>>> am I misunderstanding something?
>>
>> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
>> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
>> this patch only affects situation that clients receives truncate request from MDS in the
>> middle of write.
> It looks to me like truncates can get queued for later, so that's not the case?
> And how could the client receive a truncate while in the middle of
> writing? Either it's got the write caps (in which case nobody else can
> truncate), or it shouldn't be writing...I suppose if somebody enables
> lazyIO that would do it, but again in that case I imagine we'd rather
> do the truncate before writing:

Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can
revoke the write caps in the middle of write.

> 
>> I think it's better to do vmtruncate after write finishes.
> What if you're writing past the truncate point? Then your written data
> would be truncated off even though it should have been a file
> extension.
>
I was wrong, I thought the order of write and truncate is not important. Now I'm worry about
the correctness of commit 22cddde104, we probably should revert that commit.
 
> 
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index a1e5b81..bf7849a 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>>>  again:
>>>> -       __ceph_do_pending_vmtruncate(inode);
>>>
>>> There doesn't seem to be any kind of replacement for this? We need to
>>> do any pending truncates before reading or we might get stale data
>>> back.
>>
>> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
>> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
>> bug, that's the reason I remove it.
> 
> Oh, you're right about the locking (I think, anyway?). However, i_size
> doesn't protect us — we might for instance have truncated to zero and
> then back up to the original file size. :( But that doesn't look like
> it's handled correctly anyway (racy) — am I misreading that badly or
> is the infrastructure in fact broken in that case?

You are right. probably we can fix the race by disallowing both read and write when
there is pending truncate.

Thank you 
Yan, Zheng

> 
> As I said, I don't spend much time in the kernel client, but I looked
> around a bit more and this is what I came up with. Education, if
> nothing else. ;)
> -Greg
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-26  2:15         ` Yan, Zheng
@ 2013-02-26  5:00           ` Sage Weil
  2013-02-26 11:05             ` Yan, Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2013-02-26  5:00 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, ceph-devel

On Tue, 26 Feb 2013, Yan, Zheng wrote:
> > It looks to me like truncates can get queued for later, so that's not the case?
> > And how could the client receive a truncate while in the middle of
> > writing? Either it's got the write caps (in which case nobody else can
> > truncate), or it shouldn't be writing...I suppose if somebody enables
> > lazyIO that would do it, but again in that case I imagine we'd rather
> > do the truncate before writing:
> 
> Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can
> revoke the write caps in the middle of write.
> 
> > 
> >> I think it's better to do vmtruncate after write finishes.
> > What if you're writing past the truncate point? Then your written data
> > would be truncated off even though it should have been a file
> > extension.
> >
> I was wrong, I thought the order of write and truncate is not important. 
> Now I'm worry about the correctness of commit 22cddde104, we probably 
> should revert that commit.

I'm sorry I haven't had time to dig into this thread.  Do you think we 
should revert that commit for this merge window?

My gut feeling is that the whole locking scheme here needs to be reworked.  
I suspect where we'll end up is something more like XFS, where there is an 
explicit truncate (or other) mutex that is separate from i_mutex.  The 
fact that we were relying on i_mutex serialization from the VFS was 
fragile and (as we can see) ultimately a flawed approach.  And the locking 
and races between writes, mmap, and truncation in the page cache code are 
fragile and annoying.  The only good thing going for us is that fsx is 
pretty good at stress testing the code.  :)

> > Oh, you're right about the locking (I think, anyway?). However, i_size 
> > doesn't protect us ? we might for instance have truncated to zero and 
> > then back up to the original file size. :( But that doesn't look like 
> > it's handled correctly anyway (racy) ? am I misreading that badly or 
> > is the infrastructure in fact broken in that case?
> 
> You are right. probably we can fix the race by disallowing both read and 
> write when there is pending truncate.

FWIW, I think the high-level strategy before should still be basically 
sound: we can queue a truncation without blocking, and any write path will 
apply a pending truncation under the appropriate lock.  But I think it's 
going to mean carefully documenting the locking requirements for the 
various paths and working out a new locking structure.

Yan, is this something you are able to put some time into?  I would like 
to work on it, but it's going to be hard for me to find the time in the 
next several weeks to get to it.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-26  5:00           ` Sage Weil
@ 2013-02-26 11:05             ` Yan, Zheng
  0 siblings, 0 replies; 11+ messages in thread
From: Yan, Zheng @ 2013-02-26 11:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On 02/26/2013 01:00 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Yan, Zheng wrote:
>>> It looks to me like truncates can get queued for later, so that's not the case?
>>> And how could the client receive a truncate while in the middle of
>>> writing? Either it's got the write caps (in which case nobody else can
>>> truncate), or it shouldn't be writing...I suppose if somebody enables
>>> lazyIO that would do it, but again in that case I imagine we'd rather
>>> do the truncate before writing:
>>
>> Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can
>> revoke the write caps in the middle of write.
>>
>>>
>>>> I think it's better to do vmtruncate after write finishes.
>>> What if you're writing past the truncate point? Then your written data
>>> would be truncated off even though it should have been a file
>>> extension.
>>>
>> I was wrong, I thought the order of write and truncate is not important. 
>> Now I'm worry about the correctness of commit 22cddde104, we probably 
>> should revert that commit.
> 
> I'm sorry I haven't had time to dig into this thread.  Do you think we 
> should revert that commit for this merge window?

No, that commit fixes a i_size update bug, it is more important than the deadlock.
But I think the eventual fix is revert that commit and also remove the optimization
that drops Fw caps early (to avoid slow cap revocation caused by balance_dirty_pages)

> 
> My gut feeling is that the whole locking scheme here needs to be reworked.  
> I suspect where we'll end up is something more like XFS, where there is an 
> explicit truncate (or other) mutex that is separate from i_mutex.  The 
> fact that we were relying on i_mutex serialization from the VFS was 
> fragile and (as we can see) ultimately a flawed approach.  And the locking 
> and races between writes, mmap, and truncation in the page cache code are 
> fragile and annoying.  The only good thing going for us is that fsx is 
> pretty good at stress testing the code.  :)

If we want to keep write operation atomic, write and vmtruncate need to be
protected by a mutex. I don't think introducing a new mutex makes thing simple.

> 
>>> Oh, you're right about the locking (I think, anyway?). However, i_size 
>>> doesn't protect us ? we might for instance have truncated to zero and 
>>> then back up to the original file size. :( But that doesn't look like 
>>> it's handled correctly anyway (racy) ? am I misreading that badly or 
>>> is the infrastructure in fact broken in that case?
>>
>> You are right. probably we can fix the race by disallowing both read and 
>> write when there is pending truncate.
> 
> FWIW, I think the high-level strategy before should still be basically 
> sound: we can queue a truncation without blocking, and any write path will 
> apply a pending truncation under the appropriate lock.  But I think it's 
> going to mean carefully documenting the locking requirements for the 
> various paths and working out a new locking structure.
> 
> Yan, is this something you are able to put some time into?  I would like 
> to work on it, but it's going to be hard for me to find the time in the 
> next several weeks to get to it.

OK， I will work on it.

Regards
Yan, Zheng

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
  2013-02-26  0:01       ` Gregory Farnum
  2013-02-26  2:15         ` Yan, Zheng
@ 2013-02-26 18:19         ` Gregory Farnum
  1 sibling, 0 replies; 11+ messages in thread
From: Gregory Farnum @ 2013-02-26 18:19 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Sage Weil

On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> On 02/23/2013 02:54 AM, Gregory Farnum wrote:
>>> I haven't spent that much time in the kernel client, but this patch
>>> isn't working out for me. In particular, I'm pretty sure we need to
>>> preserve this:
>>>
>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>> index 5d5c32b..b9d8417 100644
>>>> --- a/fs/ceph/caps.c
>>>> +++ b/fs/ceph/caps.c
>>>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>>>>         }
>>>>         have = __ceph_caps_issued(ci, &implemented);
>>>>
>>>> -       /*
>>>> -        * disallow writes while a truncate is pending
>>>> -        */
>>>> -       if (ci->i_truncate_pending)
>>>> -               have &= ~CEPH_CAP_FILE_WR;
>>>> -
>>>>         if ((have & need) == need) {
>>>>                 /*
>>>>                  * Look at (implemented & ~have & not) so that we keep waiting
>>>
>>> Because if there's a pending truncate, we really can't write. You do
>>> handle it in the particular case of doing buffered file writes, but
>>> these caps are a marker of permission, and the client shouldn't have
>>> write permission to a file until it's up to date on the truncates. Or
>>> am I misunderstanding something?
>>
>> pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap,
>> the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write,
>> this patch only affects situation that clients receives truncate request from MDS in the
>> middle of write.
> It looks to me like truncates can get queued for later, so that's not the case?
> And how could the client receive a truncate while in the middle of
> writing? Either it's got the write caps (in which case nobody else can
> truncate), or it shouldn't be writing...I suppose if somebody enables
> lazyIO that would do it, but again in that case I imagine we'd rather
> do the truncate before writing:
>
>> I think it's better to do vmtruncate after write finishes.
> What if you're writing past the truncate point? Then your written data
> would be truncated off even though it should have been a file
> extension.
>
>
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index a1e5b81..bf7849a 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>>>         dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>>>              inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>>>  again:
>>>> -       __ceph_do_pending_vmtruncate(inode);
>>>
>>> There doesn't seem to be any kind of replacement for this? We need to
>>> do any pending truncates before reading or we might get stale data
>>> back.
>>
>> generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't
>> get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential
>> bug, that's the reason I remove it.
>
> Oh, you're right about the locking (I think, anyway?). However, i_size
> doesn't protect us — we might for instance have truncated to zero and
> then back up to the original file size. :( But that doesn't look like
> it's handled correctly anyway (racy) — am I misreading that badly or
> is the infrastructure in fact broken in that case?

Sage pointed out in standup today that we only send a truncate message
for truncate downs, and he thinks that the more complicated cases
(truncate to zero, write out 2KB, truncate to 1KB) are okay thanks to
the capabilities bits. I haven't thought it through but that seems
plausible to me.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-02-26 18:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-22  2:08 [PATCH 0/3] ceph: misc fixes Yan, Zheng
2013-02-22  2:08 ` [PATCH 1/3] ceph: fix LSSNAP regression Yan, Zheng
2013-02-22  2:08 ` [PATCH 2/3] ceph: queue cap release when trimming cap Yan, Zheng
2013-02-22  2:08 ` [PATCH 3/3] ceph: fix vmtruncate deadlock Yan, Zheng
2013-02-22 18:54   ` Gregory Farnum
2013-02-23  4:31     ` Yan, Zheng
2013-02-26  0:01       ` Gregory Farnum
2013-02-26  2:15         ` Yan, Zheng
2013-02-26  5:00           ` Sage Weil
2013-02-26 11:05             ` Yan, Zheng
2013-02-26 18:19         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.