All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/6] ceph: hold extra reference to r_parent over life of request
       [not found] ` <20200106153520.307523-3-jlayton@kernel.org>
@ 2020-01-09  2:05   ` Xiubo Li
  2020-01-09 11:20     ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: Xiubo Li @ 2020-01-09  2:05 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, zyan, pdonnell

On 2020/1/6 23:35, Jeff Layton wrote:
> Currently, we just assume that it will stick around by virtue of the
> submitter's reference, but later patches will allow the syscall to
> return early and we can't rely on that reference at that point.
>
> Take an extra reference to the inode when setting r_parent and release
> it when releasing the request.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/mds_client.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 94cce2ab92c4..b7122f682678 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -708,8 +708,10 @@ void ceph_mdsc_release_request(struct kref *kref)
>   		/* avoid calling iput_final() in mds dispatch threads */
>   		ceph_async_iput(req->r_inode);
>   	}
> -	if (req->r_parent)
> +	if (req->r_parent) {
>   		ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> +		ceph_async_iput(req->r_parent);
> +	}
>   	ceph_async_iput(req->r_target_inode);
>   	if (req->r_dentry)
>   		dput(req->r_dentry);
> @@ -2706,8 +2708,10 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>   	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
>   	if (req->r_inode)
>   		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> -	if (req->r_parent)
> +	if (req->r_parent) {
>   		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> +		ihold(req->r_parent);
> +	}

This might also fix another issue when the mdsc request is timedout and 
returns to the vfs, then the r_parent maybe released in vfs. And then if 
we reference it again in mdsc handle_reply() --> 
ceph_mdsc_release_request(),  some unknown issues may happen later ??

>   	if (req->r_old_dentry_dir)
>   		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
>   				  CEPH_CAP_PIN);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
       [not found] ` <20200106153520.307523-7-jlayton@kernel.org>
@ 2020-01-09  9:18   ` Yan, Zheng
  2020-01-09 11:34     ` Jeff Layton
  2020-01-09 14:49   ` Yan, Zheng
  1 sibling, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2020-01-09  9:18 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/6/20 11:35 PM, Jeff Layton wrote:
> From: "Yan, Zheng" <zyan@redhat.com>
> 
> The MDS is getting a new lock-caching facility that will allow it
> to cache the necessary locks to allow asynchronous directory operations.
> Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> we can repurpose those bits for this purpose.
> 
> When performing an unlink, if we have Fx on the parent directory,
> and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> removed is the primary link, then then we can fire off an unlink
> request immediately and don't need to wait on reply before returning.
> 
> In that situation, just fix up the dcache and link count and return
> immediately after issuing the call to the MDS. This does mean that we
> need to hold an extra reference to the inode being unlinked, and extra
> references to the caps to avoid races. Those references are put and
> error handling is done in the r_callback routine.
> 
> If the operation ends up failing, then set a writeback error on the
> directory inode that can be fetched later by an fsync on the dir.
> 
> Since this feature is very new, also add a new module parameter to
> enable and disable it (default is disabled).
> 
> Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c               | 35 ++++++++++++------
>   fs/ceph/dir.c                | 70 +++++++++++++++++++++++++++++++++---
>   fs/ceph/inode.c              |  8 ++++-
>   fs/ceph/super.c              |  4 +++
>   fs/ceph/super.h              |  3 ++
>   include/linux/ceph/ceph_fs.h |  9 +++++
>   6 files changed, 113 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d05717397c2a..7fc87b693ba4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
>   int __ceph_caps_wanted(struct ceph_inode_info *ci)
>   {
>   	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> -	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> +	if (S_ISDIR(ci->vfs_inode.i_mode)) {
> +		/* we want EXCL if holding caps of dir ops */
> +		if (w & CEPH_CAP_ANY_DIR_OPS)
> +			w |= CEPH_CAP_FILE_EXCL;
> +	} else {
>   		/* we want EXCL if dirty data */
>   		if (w & CEPH_CAP_FILE_BUFFER)
>   			w |= CEPH_CAP_FILE_EXCL;
> @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>   			 * revoking the shared cap on every create/unlink
>   			 * operation.
>   			 */
> -			if (IS_RDONLY(inode))
> +			if (IS_RDONLY(inode)) {
>   				want = CEPH_CAP_ANY_SHARED;
> -			else
> -				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> +			} else {
> +				want = CEPH_CAP_ANY_SHARED |
> +				       CEPH_CAP_FILE_EXCL |
> +				       CEPH_CAP_ANY_DIR_OPS;
> +			}
>   			retain |= want;
>   		} else {
>   
> @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
>   				}
>   				snap_rwsem_locked = true;
>   			}
> -			*got = need | (have & want);
> +			if ((have & want) == want)
> +				*got = need | want;
> +			else
> +				*got = need;
>   			if (S_ISREG(inode->i_mode) &&
>   			    (need & CEPH_CAP_FILE_RD) &&
>   			    !(*got & CEPH_CAP_FILE_CACHE))
> @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
>   	int ret;
>   
>   	BUG_ON(need & ~CEPH_CAP_FILE_RD);
> -	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> -	ret = ceph_pool_perm_check(inode, need);
> -	if (ret < 0)
> -		return ret;
> +	if (need) {
> +		ret = ceph_pool_perm_check(inode, need);
> +		if (ret < 0)
> +			return ret;
> +	}
>   
> -	ret = try_get_cap_refs(inode, need, want, 0,
> -			       (nonblock ? NON_BLOCKING : 0), got);
> +	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> +			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +			CEPH_CAP_ANY_DIR_OPS));
> +	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
>   	return ret == -EAGAIN ? 0 : ret;
>   }
>   
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index d0cd0aba5843..10294f07f5f0 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1036,6 +1036,47 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
>   	return err;
>   }
>   
> +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> +				 struct ceph_mds_request *req)
> +{
> +	/* If op failed, set error on parent directory */
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +	if (req->r_err)
> +		printk("%s: req->r_err = %d\n", __func__, req->r_err);
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
> +	iput(req->r_old_inode);
> +}
> +
> +static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct ceph_dentry_info *di;
> +	int ret, want, got;
> +
> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> +	if (ret != 1 || got != want)
> +		return false;
> +
> +        spin_lock(&dentry->d_lock);
> +        di = ceph_dentry(dentry);
> +	/* - We are holding CEPH_CAP_FILE_EXCL, which implies
> +	 * CEPH_CAP_FILE_SHARED.
> +	 * - Only support async unlink for primary linkage */
> +	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
> +	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
> +		ret = 0;
> +        spin_unlock(&dentry->d_lock);
> +
> +	if (!ret) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +	return true;
> +}
> +
>   /*
>    * rmdir and unlink are differ only by the metadata op code
>    */
> @@ -1067,13 +1108,33 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
>   	req->r_parent = dir;
> -	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
>   	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>   	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> -	err = ceph_mdsc_do_request(mdsc, dir, req);
> -	if (!err && !req->r_reply_info.head->is_dentry)
> -		d_delete(dentry);
> +
> +	if (enable_async_dirops && op == CEPH_MDS_OP_UNLINK &&
> +	    get_caps_for_async_unlink(dir, dentry)) {
> +		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
> +		     dentry->d_name.len, dentry->d_name.name);
> +		req->r_callback = ceph_async_unlink_cb;
> +		req->r_old_inode = d_inode(dentry);
> +		ihold(req->r_old_inode);
> +		err = ceph_mdsc_submit_request(mdsc, dir, req);
> +		if (!err) {
> +			/*
> +			 * We have enough caps, so we assume that the unlink
> +			 * will succeed. Fix up the target inode and dcache.
> +			 */
> +			drop_nlink(inode);
> +			d_delete(dentry);
> +		}
> +	} else {
> +		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> +		err = ceph_mdsc_do_request(mdsc, dir, req);
> +		if (!err && !req->r_reply_info.head->is_dentry)
> +			d_delete(dentry);
> +	}
> +
>   	ceph_mdsc_put_request(req);
>   out:
>   	return err;
> @@ -1411,6 +1472,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
>   	spin_lock(&dentry->d_lock);
>   	di->time = jiffies;
>   	di->lease_shared_gen = 0;
> +	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
>   	__dentry_lease_unlist(di);
>   	spin_unlock(&dentry->d_lock);
>   }
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index ef9e8281d485..ffef475af72b 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1048,6 +1048,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>   				  struct ceph_mds_session **old_lease_session)
>   {
>   	struct ceph_dentry_info *di = ceph_dentry(dentry);
> +	unsigned mask = le16_to_cpu(lease->mask);
>   	long unsigned duration = le32_to_cpu(lease->duration_ms);
>   	long unsigned ttl = from_time + (duration * HZ) / 1000;
>   	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
> @@ -1059,8 +1060,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>   	if (ceph_snap(dir) != CEPH_NOSNAP)
>   		return;
>   
> +	if (mask & CEPH_LEASE_PRIMARY_LINK)
> +		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
> +	else
> +		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> +
>   	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
> -	if (duration == 0) {
> +	if (!(mask & CEPH_LEASE_VALID)) {
>   		__ceph_dentry_dir_lease_touch(di);
>   		return;
>   	}
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 112927dbd2f2..c149edb6aa2d 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -1324,6 +1324,10 @@ static void __exit exit_ceph(void)
>   	destroy_caches();
>   }
>   
> +bool enable_async_dirops;
> +module_param(enable_async_dirops, bool, 0644);
> +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> +

why not use mount option

>   module_init(init_ceph);
>   module_exit(exit_ceph);
>   
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 4b53f32b9324..4083de451710 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -72,6 +72,8 @@
>   #define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT      5  /* cap release delay */
>   #define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
>   
> +extern bool enable_async_dirops;
> +
>   struct ceph_mount_options {
>   	unsigned int flags;
>   
> @@ -282,6 +284,7 @@ struct ceph_dentry_info {
>   #define CEPH_DENTRY_REFERENCED		1
>   #define CEPH_DENTRY_LEASE_LIST		2
>   #define CEPH_DENTRY_SHRINK_LIST		4
> +#define CEPH_DENTRY_PRIMARY_LINK	8
>   
>   struct ceph_inode_xattrs_info {
>   	/*
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index cb21c5cf12c3..a099f60feb7b 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -530,6 +530,9 @@ struct ceph_mds_reply_lease {
>   	__le32 seq;
>   } __attribute__ ((packed));
>   
> +#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
> +#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
> +
>   struct ceph_mds_reply_dirfrag {
>   	__le32 frag;            /* fragment */
>   	__le32 auth;            /* auth mds, if this is a delegation point */
> @@ -659,6 +662,12 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)
>   
> +/* cap masks async dir operations */
> +#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
> +#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
> +#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
> +				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
> +
>   int ceph_caps_for_mode(int mode);
>   
>   enum {
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] ceph: hold extra reference to r_parent over life of request
  2020-01-09  2:05   ` [PATCH 2/6] ceph: hold extra reference to r_parent over life of request Xiubo Li
@ 2020-01-09 11:20     ` Jeff Layton
  2020-01-09 13:16       ` Xiubo Li
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2020-01-09 11:20 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: sage, idryomov, zyan, pdonnell

On Thu, 2020-01-09 at 10:05 +0800, Xiubo Li wrote:
> On 2020/1/6 23:35, Jeff Layton wrote:
> > Currently, we just assume that it will stick around by virtue of the
> > submitter's reference, but later patches will allow the syscall to
> > return early and we can't rely on that reference at that point.
> > 
> > Take an extra reference to the inode when setting r_parent and release
> > it when releasing the request.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/mds_client.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 94cce2ab92c4..b7122f682678 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -708,8 +708,10 @@ void ceph_mdsc_release_request(struct kref *kref)
> >   		/* avoid calling iput_final() in mds dispatch threads */
> >   		ceph_async_iput(req->r_inode);
> >   	}
> > -	if (req->r_parent)
> > +	if (req->r_parent) {
> >   		ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> > +		ceph_async_iput(req->r_parent);
> > +	}
> >   	ceph_async_iput(req->r_target_inode);
> >   	if (req->r_dentry)
> >   		dput(req->r_dentry);
> > @@ -2706,8 +2708,10 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
> >   	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
> >   	if (req->r_inode)
> >   		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> > -	if (req->r_parent)
> > +	if (req->r_parent) {
> >   		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> > +		ihold(req->r_parent);
> > +	}
> 
> This might also fix another issue when the mdsc request is timedout and 
> returns to the vfs, then the r_parent maybe released in vfs. And then if 
> we reference it again in mdsc handle_reply() --> 
> ceph_mdsc_release_request(),  some unknown issues may happen later ??
> 

AIUI, when a timeout occurs, the req is unhashed such that handle_reply
can't find it. So, I doubt this affects that one way or another.

> >   	if (req->r_old_dentry_dir)
> >   		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
> >   				  CEPH_CAP_PIN);
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
  2020-01-09  9:18   ` [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps Yan, Zheng
@ 2020-01-09 11:34     ` Jeff Layton
  2020-01-09 17:53       ` Patrick Donnelly
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2020-01-09 11:34 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Thu, 2020-01-09 at 17:18 +0800, Yan, Zheng wrote:
> On 1/6/20 11:35 PM, Jeff Layton wrote:
> > From: "Yan, Zheng" <zyan@redhat.com>
> > 
> > The MDS is getting a new lock-caching facility that will allow it
> > to cache the necessary locks to allow asynchronous directory operations.
> > Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> > we can repurpose those bits for this purpose.
> > 
> > When performing an unlink, if we have Fx on the parent directory,
> > and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> > removed is the primary link, then then we can fire off an unlink
> > request immediately and don't need to wait on reply before returning.
> > 
> > In that situation, just fix up the dcache and link count and return
> > immediately after issuing the call to the MDS. This does mean that we
> > need to hold an extra reference to the inode being unlinked, and extra
> > references to the caps to avoid races. Those references are put and
> > error handling is done in the r_callback routine.
> > 
> > If the operation ends up failing, then set a writeback error on the
> > directory inode that can be fetched later by an fsync on the dir.
> > 
> > Since this feature is very new, also add a new module parameter to
> > enable and disable it (default is disabled).
> > 
> > Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c               | 35 ++++++++++++------
> >   fs/ceph/dir.c                | 70 +++++++++++++++++++++++++++++++++---
> >   fs/ceph/inode.c              |  8 ++++-
> >   fs/ceph/super.c              |  4 +++
> >   fs/ceph/super.h              |  3 ++
> >   include/linux/ceph/ceph_fs.h |  9 +++++
> >   6 files changed, 113 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index d05717397c2a..7fc87b693ba4 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
> >   int __ceph_caps_wanted(struct ceph_inode_info *ci)
> >   {
> >   	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> > -	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> > +	if (S_ISDIR(ci->vfs_inode.i_mode)) {
> > +		/* we want EXCL if holding caps of dir ops */
> > +		if (w & CEPH_CAP_ANY_DIR_OPS)
> > +			w |= CEPH_CAP_FILE_EXCL;
> > +	} else {
> >   		/* we want EXCL if dirty data */
> >   		if (w & CEPH_CAP_FILE_BUFFER)
> >   			w |= CEPH_CAP_FILE_EXCL;
> > @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> >   			 * revoking the shared cap on every create/unlink
> >   			 * operation.
> >   			 */
> > -			if (IS_RDONLY(inode))
> > +			if (IS_RDONLY(inode)) {
> >   				want = CEPH_CAP_ANY_SHARED;
> > -			else
> > -				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> > +			} else {
> > +				want = CEPH_CAP_ANY_SHARED |
> > +				       CEPH_CAP_FILE_EXCL |
> > +				       CEPH_CAP_ANY_DIR_OPS;
> > +			}
> >   			retain |= want;
> >   		} else {
> >   
> > @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
> >   				}
> >   				snap_rwsem_locked = true;
> >   			}
> > -			*got = need | (have & want);
> > +			if ((have & want) == want)
> > +				*got = need | want;
> > +			else
> > +				*got = need;
> >   			if (S_ISREG(inode->i_mode) &&
> >   			    (need & CEPH_CAP_FILE_RD) &&
> >   			    !(*got & CEPH_CAP_FILE_CACHE))
> > @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
> >   	int ret;
> >   
> >   	BUG_ON(need & ~CEPH_CAP_FILE_RD);
> > -	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> > -	ret = ceph_pool_perm_check(inode, need);
> > -	if (ret < 0)
> > -		return ret;
> > +	if (need) {
> > +		ret = ceph_pool_perm_check(inode, need);
> > +		if (ret < 0)
> > +			return ret;
> > +	}
> >   
> > -	ret = try_get_cap_refs(inode, need, want, 0,
> > -			       (nonblock ? NON_BLOCKING : 0), got);
> > +	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> > +			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> > +			CEPH_CAP_ANY_DIR_OPS));
> > +	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
> >   	return ret == -EAGAIN ? 0 : ret;
> >   }
> >   
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index d0cd0aba5843..10294f07f5f0 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1036,6 +1036,47 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
> >   	return err;
> >   }
> >   
> > +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> > +				 struct ceph_mds_request *req)
> > +{
> > +	/* If op failed, set error on parent directory */
> > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > +	if (req->r_err)
> > +		printk("%s: req->r_err = %d\n", __func__, req->r_err);
> > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
> > +	iput(req->r_old_inode);
> > +}
> > +
> > +static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	struct ceph_dentry_info *di;
> > +	int ret, want, got;
> > +
> > +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
> > +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> > +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> > +	if (ret != 1 || got != want)
> > +		return false;
> > +
> > +        spin_lock(&dentry->d_lock);
> > +        di = ceph_dentry(dentry);
> > +	/* - We are holding CEPH_CAP_FILE_EXCL, which implies
> > +	 * CEPH_CAP_FILE_SHARED.
> > +	 * - Only support async unlink for primary linkage */
> > +	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
> > +	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
> > +		ret = 0;
> > +        spin_unlock(&dentry->d_lock);
> > +
> > +	if (!ret) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +	return true;
> > +}
> > +
> >   /*
> >    * rmdir and unlink are differ only by the metadata op code
> >    */
> > @@ -1067,13 +1108,33 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
> >   	req->r_dentry = dget(dentry);
> >   	req->r_num_caps = 2;
> >   	req->r_parent = dir;
> > -	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >   	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
> >   	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> >   	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> > -	err = ceph_mdsc_do_request(mdsc, dir, req);
> > -	if (!err && !req->r_reply_info.head->is_dentry)
> > -		d_delete(dentry);
> > +
> > +	if (enable_async_dirops && op == CEPH_MDS_OP_UNLINK &&
> > +	    get_caps_for_async_unlink(dir, dentry)) {
> > +		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
> > +		     dentry->d_name.len, dentry->d_name.name);
> > +		req->r_callback = ceph_async_unlink_cb;
> > +		req->r_old_inode = d_inode(dentry);
> > +		ihold(req->r_old_inode);
> > +		err = ceph_mdsc_submit_request(mdsc, dir, req);
> > +		if (!err) {
> > +			/*
> > +			 * We have enough caps, so we assume that the unlink
> > +			 * will succeed. Fix up the target inode and dcache.
> > +			 */
> > +			drop_nlink(inode);
> > +			d_delete(dentry);
> > +		}
> > +	} else {
> > +		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> > +		err = ceph_mdsc_do_request(mdsc, dir, req);
> > +		if (!err && !req->r_reply_info.head->is_dentry)
> > +			d_delete(dentry);
> > +	}
> > +
> >   	ceph_mdsc_put_request(req);
> >   out:
> >   	return err;
> > @@ -1411,6 +1472,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
> >   	spin_lock(&dentry->d_lock);
> >   	di->time = jiffies;
> >   	di->lease_shared_gen = 0;
> > +	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> >   	__dentry_lease_unlist(di);
> >   	spin_unlock(&dentry->d_lock);
> >   }
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index ef9e8281d485..ffef475af72b 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -1048,6 +1048,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
> >   				  struct ceph_mds_session **old_lease_session)
> >   {
> >   	struct ceph_dentry_info *di = ceph_dentry(dentry);
> > +	unsigned mask = le16_to_cpu(lease->mask);
> >   	long unsigned duration = le32_to_cpu(lease->duration_ms);
> >   	long unsigned ttl = from_time + (duration * HZ) / 1000;
> >   	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
> > @@ -1059,8 +1060,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
> >   	if (ceph_snap(dir) != CEPH_NOSNAP)
> >   		return;
> >   
> > +	if (mask & CEPH_LEASE_PRIMARY_LINK)
> > +		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
> > +	else
> > +		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> > +
> >   	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
> > -	if (duration == 0) {
> > +	if (!(mask & CEPH_LEASE_VALID)) {
> >   		__ceph_dentry_dir_lease_touch(di);
> >   		return;
> >   	}
> > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > index 112927dbd2f2..c149edb6aa2d 100644
> > --- a/fs/ceph/super.c
> > +++ b/fs/ceph/super.c
> > @@ -1324,6 +1324,10 @@ static void __exit exit_ceph(void)
> >   	destroy_caches();
> >   }
> >   
> > +bool enable_async_dirops;
> > +module_param(enable_async_dirops, bool, 0644);
> > +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> > +
> 
> why not use mount option
>

I'm open to suggestions here.

I mostly put this in originally to help facilitate performance testing.
A module option makes it easy to change this at runtime (without having
to remount or anything).

That said, we probably _do_ want to have a way for users to enable or
disable this feature. We'll probably want this disabled by default
initially, but I can forsee that changing once get more confidence.

Mount options are a bit of a pain to deal with over time. If the
defaults change, we need to document that in the manpages and online
documentation. If you put a mount option in the fstab, then you have to
deal with breakage if you boot to an earlier kernel that doesn't support
that option.

My thinking is that we should just use a module option initially (for
the really early adopters) and only convert that to a mount option as
the need for it becomes clear.


> >   module_init(init_ceph);
> >   module_exit(exit_ceph);
> >   
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index 4b53f32b9324..4083de451710 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -72,6 +72,8 @@
> >   #define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT      5  /* cap release delay */
> >   #define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
> >   
> > +extern bool enable_async_dirops;
> > +
> >   struct ceph_mount_options {
> >   	unsigned int flags;
> >   
> > @@ -282,6 +284,7 @@ struct ceph_dentry_info {
> >   #define CEPH_DENTRY_REFERENCED		1
> >   #define CEPH_DENTRY_LEASE_LIST		2
> >   #define CEPH_DENTRY_SHRINK_LIST		4
> > +#define CEPH_DENTRY_PRIMARY_LINK	8
> >   
> >   struct ceph_inode_xattrs_info {
> >   	/*
> > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > index cb21c5cf12c3..a099f60feb7b 100644
> > --- a/include/linux/ceph/ceph_fs.h
> > +++ b/include/linux/ceph/ceph_fs.h
> > @@ -530,6 +530,9 @@ struct ceph_mds_reply_lease {
> >   	__le32 seq;
> >   } __attribute__ ((packed));
> >   
> > +#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
> > +#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
> > +
> >   struct ceph_mds_reply_dirfrag {
> >   	__le32 frag;            /* fragment */
> >   	__le32 auth;            /* auth mds, if this is a delegation point */
> > @@ -659,6 +662,12 @@ int ceph_flags_to_mode(int flags);
> >   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> >   			CEPH_LOCK_IXATTR)
> >   
> > +/* cap masks async dir operations */
> > +#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
> > +#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
> > +#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
> > +				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
> > +
> >   int ceph_caps_for_mode(int mode);
> >   
> >   enum {
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] ceph: hold extra reference to r_parent over life of request
  2020-01-09 11:20     ` Jeff Layton
@ 2020-01-09 13:16       ` Xiubo Li
  2020-01-09 13:33         ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: Xiubo Li @ 2020-01-09 13:16 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, zyan, pdonnell

On 2020/1/9 19:20, Jeff Layton wrote:
> On Thu, 2020-01-09 at 10:05 +0800, Xiubo Li wrote:
>> On 2020/1/6 23:35, Jeff Layton wrote:
>>> Currently, we just assume that it will stick around by virtue of the
>>> submitter's reference, but later patches will allow the syscall to
>>> return early and we can't rely on that reference at that point.
>>>
>>> Take an extra reference to the inode when setting r_parent and release
>>> it when releasing the request.
>>>
>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>> ---
>>>    fs/ceph/mds_client.c | 8 ++++++--
>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index 94cce2ab92c4..b7122f682678 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -708,8 +708,10 @@ void ceph_mdsc_release_request(struct kref *kref)
>>>    		/* avoid calling iput_final() in mds dispatch threads */
>>>    		ceph_async_iput(req->r_inode);
>>>    	}
>>> -	if (req->r_parent)
>>> +	if (req->r_parent) {
>>>    		ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
>>> +		ceph_async_iput(req->r_parent);
>>> +	}
>>>    	ceph_async_iput(req->r_target_inode);
>>>    	if (req->r_dentry)
>>>    		dput(req->r_dentry);
>>> @@ -2706,8 +2708,10 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>>>    	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
>>>    	if (req->r_inode)
>>>    		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
>>> -	if (req->r_parent)
>>> +	if (req->r_parent) {
>>>    		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
>>> +		ihold(req->r_parent);
>>> +	}
>> This might also fix another issue when the mdsc request is timedout and
>> returns to the vfs, then the r_parent maybe released in vfs. And then if
>> we reference it again in mdsc handle_reply() -->
>> ceph_mdsc_release_request(),  some unknown issues may happen later ??
>>
> AIUI, when a timeout occurs, the req is unhashed such that handle_reply
> can't find it. So, I doubt this affects that one way or another.

If my understanding is correct, such as for rmdir(), the logic will be :

req = ceph_mdsc_create_request()      //  ref == 1

ceph_mdsc_do_request(req) -->

         ceph_mdsc_submit_request(req) -->

                 __register_request(req) // ref == 2

         ceph_mdsc_wait_request(req)  // If timedout

ceph_mdsc_put_request(req)  // ref == 1

Then in handled_reply(), only when we get a safe reply it will call 
__unregister_request(req), then the ref could be 0.

Though it will ihold()/ceph_async_iput() the req->r_unsafe_dir(= 
req->r_parent) , but the _iput() will be called just before we reference 
the req->r_parent in the _relase_request(). And the _iput() here may 
call the iput_final().

BRs



>>>    	if (req->r_old_dentry_dir)
>>>    		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
>>>    				  CEPH_CAP_PIN);
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] ceph: hold extra reference to r_parent over life of request
  2020-01-09 13:16       ` Xiubo Li
@ 2020-01-09 13:33         ` Jeff Layton
  2020-01-10  1:41           ` Xiubo Li
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2020-01-09 13:33 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: sage, idryomov, zyan, pdonnell

On Thu, 2020-01-09 at 21:16 +0800, Xiubo Li wrote:
> On 2020/1/9 19:20, Jeff Layton wrote:
> > On Thu, 2020-01-09 at 10:05 +0800, Xiubo Li wrote:
> > > On 2020/1/6 23:35, Jeff Layton wrote:
> > > > Currently, we just assume that it will stick around by virtue of the
> > > > submitter's reference, but later patches will allow the syscall to
> > > > return early and we can't rely on that reference at that point.
> > > > 
> > > > Take an extra reference to the inode when setting r_parent and release
> > > > it when releasing the request.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >    fs/ceph/mds_client.c | 8 ++++++--
> > > >    1 file changed, 6 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > index 94cce2ab92c4..b7122f682678 100644
> > > > --- a/fs/ceph/mds_client.c
> > > > +++ b/fs/ceph/mds_client.c
> > > > @@ -708,8 +708,10 @@ void ceph_mdsc_release_request(struct kref *kref)
> > > >    		/* avoid calling iput_final() in mds dispatch threads */
> > > >    		ceph_async_iput(req->r_inode);
> > > >    	}
> > > > -	if (req->r_parent)
> > > > +	if (req->r_parent) {
> > > >    		ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> > > > +		ceph_async_iput(req->r_parent);
> > > > +	}
> > > >    	ceph_async_iput(req->r_target_inode);
> > > >    	if (req->r_dentry)
> > > >    		dput(req->r_dentry);
> > > > @@ -2706,8 +2708,10 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
> > > >    	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
> > > >    	if (req->r_inode)
> > > >    		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> > > > -	if (req->r_parent)
> > > > +	if (req->r_parent) {
> > > >    		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
> > > > +		ihold(req->r_parent);
> > > > +	}
> > > This might also fix another issue when the mdsc request is timedout and
> > > returns to the vfs, then the r_parent maybe released in vfs. And then if
> > > we reference it again in mdsc handle_reply() -->
> > > ceph_mdsc_release_request(),  some unknown issues may happen later ??
> > > 
> > AIUI, when a timeout occurs, the req is unhashed such that handle_reply
> > can't find it. So, I doubt this affects that one way or another.
> 
> If my understanding is correct, such as for rmdir(), the logic will be :
> 
> req = ceph_mdsc_create_request()      //  ref == 1
> 
> ceph_mdsc_do_request(req) -->
> 
>          ceph_mdsc_submit_request(req) -->
> 
>                  __register_request(req) // ref == 2
> 
>          ceph_mdsc_wait_request(req)  // If timedout
> 
> ceph_mdsc_put_request(req)  // ref == 1
> 
> Then in handled_reply(), only when we get a safe reply it will call 
> __unregister_request(req), then the ref could be 0.
> 
> Though it will ihold()/ceph_async_iput() the req->r_unsafe_dir(= 
> req->r_parent) , but the _iput() will be called just before we reference 
> the req->r_parent in the _relase_request(). And the _iput() here may 
> call the iput_final().
> 

I take it back, I think you're right. This likely would fix that issue
up. I'll plan to add a note about that to the changelog before I merge
it. Should we mark this for stable in light of that?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] ceph: asynchronous unlink support
       [not found] <20200106153520.307523-1-jlayton@kernel.org>
       [not found] ` <20200106153520.307523-3-jlayton@kernel.org>
@ 2020-01-09 13:58 ` Yan, Zheng
       [not found] ` <20200106153520.307523-7-jlayton@kernel.org>
  2 siblings, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2020-01-09 13:58 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/6/20 11:35 PM, Jeff Layton wrote:
> I sent an initial RFC set for this around 10 months ago. Since then,
> the requisite patches for the MDS have been merged for the octopus
> release. This adds support to the kclient to take advantage of
> asynchronous unlinks.
> 
> In earlier testing (with a vstart cluster backed by a rotating HDD), I
> saw roughly a 2x speedup when doing an rmdir on a directory with 10000
> files in it. When testing with a cluster backed by an NVMe SSD though,
> I only saw about a 20% speedup.
> 
> I'd like to put this in the testing branch now, so that it's ready for
> merge in the upcoming v5.6 merge window. Once this is in, asynchronous
> create support will be the next step.
> 
> Jeff Layton (4):
>    ceph: close holes in struct ceph_mds_session
>    ceph: hold extra reference to r_parent over life of request
>    ceph: register MDS request with dir inode from the start
>    ceph: add refcounting for Fx caps
> 
> Yan, Zheng (2):
>    ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}
>    ceph: perform asynchronous unlink if we have sufficient caps
> 
>   fs/ceph/caps.c               | 84 ++++++++++++++++++++++++++----------
>   fs/ceph/dir.c                | 70 ++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  9 +++-
>   fs/ceph/mds_client.c         | 27 ++++++------
>   fs/ceph/mds_client.h         |  2 +-
>   fs/ceph/super.c              |  4 ++
>   fs/ceph/super.h              | 17 +++-----
>   include/linux/ceph/ceph_fs.h |  9 ++++
>   8 files changed, 169 insertions(+), 53 deletions(-)
> 

Series Reviewed-by: "Yan, Zheng" <zyan@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
       [not found] ` <20200106153520.307523-7-jlayton@kernel.org>
  2020-01-09  9:18   ` [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps Yan, Zheng
@ 2020-01-09 14:49   ` Yan, Zheng
  2020-01-09 15:03     ` Jeff Layton
  1 sibling, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2020-01-09 14:49 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/6/20 11:35 PM, Jeff Layton wrote:
> From: "Yan, Zheng" <zyan@redhat.com>
> 
> The MDS is getting a new lock-caching facility that will allow it
> to cache the necessary locks to allow asynchronous directory operations.
> Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> we can repurpose those bits for this purpose.
> 
> When performing an unlink, if we have Fx on the parent directory,
> and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> removed is the primary link, then then we can fire off an unlink
> request immediately and don't need to wait on reply before returning.
> 
> In that situation, just fix up the dcache and link count and return
> immediately after issuing the call to the MDS. This does mean that we
> need to hold an extra reference to the inode being unlinked, and extra
> references to the caps to avoid races. Those references are put and
> error handling is done in the r_callback routine.
> 
> If the operation ends up failing, then set a writeback error on the
> directory inode that can be fetched later by an fsync on the dir.
> 
> Since this feature is very new, also add a new module parameter to
> enable and disable it (default is disabled).

author of this patch is not me :)

> 
> Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c               | 35 ++++++++++++------
>   fs/ceph/dir.c                | 70 +++++++++++++++++++++++++++++++++---
>   fs/ceph/inode.c              |  8 ++++-
>   fs/ceph/super.c              |  4 +++
>   fs/ceph/super.h              |  3 ++
>   include/linux/ceph/ceph_fs.h |  9 +++++
>   6 files changed, 113 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d05717397c2a..7fc87b693ba4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
>   int __ceph_caps_wanted(struct ceph_inode_info *ci)
>   {
>   	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> -	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> +	if (S_ISDIR(ci->vfs_inode.i_mode)) {
> +		/* we want EXCL if holding caps of dir ops */
> +		if (w & CEPH_CAP_ANY_DIR_OPS)
> +			w |= CEPH_CAP_FILE_EXCL;
> +	} else {
>   		/* we want EXCL if dirty data */
>   		if (w & CEPH_CAP_FILE_BUFFER)
>   			w |= CEPH_CAP_FILE_EXCL;
> @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>   			 * revoking the shared cap on every create/unlink
>   			 * operation.
>   			 */
> -			if (IS_RDONLY(inode))
> +			if (IS_RDONLY(inode)) {
>   				want = CEPH_CAP_ANY_SHARED;
> -			else
> -				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> +			} else {
> +				want = CEPH_CAP_ANY_SHARED |
> +				       CEPH_CAP_FILE_EXCL |
> +				       CEPH_CAP_ANY_DIR_OPS;
> +			}
>   			retain |= want;
>   		} else {
>   
> @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
>   				}
>   				snap_rwsem_locked = true;
>   			}
> -			*got = need | (have & want);
> +			if ((have & want) == want)
> +				*got = need | want;
> +			else
> +				*got = need;
>   			if (S_ISREG(inode->i_mode) &&
>   			    (need & CEPH_CAP_FILE_RD) &&
>   			    !(*got & CEPH_CAP_FILE_CACHE))
> @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
>   	int ret;
>   
>   	BUG_ON(need & ~CEPH_CAP_FILE_RD);
> -	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> -	ret = ceph_pool_perm_check(inode, need);
> -	if (ret < 0)
> -		return ret;
> +	if (need) {
> +		ret = ceph_pool_perm_check(inode, need);
> +		if (ret < 0)
> +			return ret;
> +	}
>   
> -	ret = try_get_cap_refs(inode, need, want, 0,
> -			       (nonblock ? NON_BLOCKING : 0), got);
> +	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> +			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +			CEPH_CAP_ANY_DIR_OPS));
> +	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
>   	return ret == -EAGAIN ? 0 : ret;
>   }
>   
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index d0cd0aba5843..10294f07f5f0 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1036,6 +1036,47 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
>   	return err;
>   }
>   
> +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> +				 struct ceph_mds_request *req)
> +{
> +	/* If op failed, set error on parent directory */
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +	if (req->r_err)
> +		printk("%s: req->r_err = %d\n", __func__, req->r_err);
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
> +	iput(req->r_old_inode);
> +}
> +
> +static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct ceph_dentry_info *di;
> +	int ret, want, got;
> +
> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> +	if (ret != 1 || got != want)
> +		return false;
> +
> +        spin_lock(&dentry->d_lock);
> +        di = ceph_dentry(dentry);
> +	/* - We are holding CEPH_CAP_FILE_EXCL, which implies
> +	 * CEPH_CAP_FILE_SHARED.
> +	 * - Only support async unlink for primary linkage */
> +	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
> +	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
> +		ret = 0;
> +        spin_unlock(&dentry->d_lock);
> +
> +	if (!ret) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +	return true;
> +}
> +
>   /*
>    * rmdir and unlink are differ only by the metadata op code
>    */
> @@ -1067,13 +1108,33 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
>   	req->r_parent = dir;
> -	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
>   	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>   	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> -	err = ceph_mdsc_do_request(mdsc, dir, req);
> -	if (!err && !req->r_reply_info.head->is_dentry)
> -		d_delete(dentry);
> +
> +	if (enable_async_dirops && op == CEPH_MDS_OP_UNLINK &&
> +	    get_caps_for_async_unlink(dir, dentry)) {
> +		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
> +		     dentry->d_name.len, dentry->d_name.name);
> +		req->r_callback = ceph_async_unlink_cb;
> +		req->r_old_inode = d_inode(dentry);
> +		ihold(req->r_old_inode);
> +		err = ceph_mdsc_submit_request(mdsc, dir, req);
> +		if (!err) {
> +			/*
> +			 * We have enough caps, so we assume that the unlink
> +			 * will succeed. Fix up the target inode and dcache.
> +			 */
> +			drop_nlink(inode);
> +			d_delete(dentry);
> +		}
> +	} else {
> +		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> +		err = ceph_mdsc_do_request(mdsc, dir, req);
> +		if (!err && !req->r_reply_info.head->is_dentry)
> +			d_delete(dentry);
> +	}
> +
>   	ceph_mdsc_put_request(req);
>   out:
>   	return err;
> @@ -1411,6 +1472,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
>   	spin_lock(&dentry->d_lock);
>   	di->time = jiffies;
>   	di->lease_shared_gen = 0;
> +	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
>   	__dentry_lease_unlist(di);
>   	spin_unlock(&dentry->d_lock);
>   }
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index ef9e8281d485..ffef475af72b 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1048,6 +1048,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>   				  struct ceph_mds_session **old_lease_session)
>   {
>   	struct ceph_dentry_info *di = ceph_dentry(dentry);
> +	unsigned mask = le16_to_cpu(lease->mask);
>   	long unsigned duration = le32_to_cpu(lease->duration_ms);
>   	long unsigned ttl = from_time + (duration * HZ) / 1000;
>   	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
> @@ -1059,8 +1060,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>   	if (ceph_snap(dir) != CEPH_NOSNAP)
>   		return;
>   
> +	if (mask & CEPH_LEASE_PRIMARY_LINK)
> +		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
> +	else
> +		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> +
>   	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
> -	if (duration == 0) {
> +	if (!(mask & CEPH_LEASE_VALID)) {
>   		__ceph_dentry_dir_lease_touch(di);
>   		return;
>   	}
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 112927dbd2f2..c149edb6aa2d 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -1324,6 +1324,10 @@ static void __exit exit_ceph(void)
>   	destroy_caches();
>   }
>   
> +bool enable_async_dirops;
> +module_param(enable_async_dirops, bool, 0644);
> +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> +
>   module_init(init_ceph);
>   module_exit(exit_ceph);
>   
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 4b53f32b9324..4083de451710 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -72,6 +72,8 @@
>   #define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT      5  /* cap release delay */
>   #define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
>   
> +extern bool enable_async_dirops;
> +
>   struct ceph_mount_options {
>   	unsigned int flags;
>   
> @@ -282,6 +284,7 @@ struct ceph_dentry_info {
>   #define CEPH_DENTRY_REFERENCED		1
>   #define CEPH_DENTRY_LEASE_LIST		2
>   #define CEPH_DENTRY_SHRINK_LIST		4
> +#define CEPH_DENTRY_PRIMARY_LINK	8
>   
>   struct ceph_inode_xattrs_info {
>   	/*
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index cb21c5cf12c3..a099f60feb7b 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -530,6 +530,9 @@ struct ceph_mds_reply_lease {
>   	__le32 seq;
>   } __attribute__ ((packed));
>   
> +#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
> +#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
> +
>   struct ceph_mds_reply_dirfrag {
>   	__le32 frag;            /* fragment */
>   	__le32 auth;            /* auth mds, if this is a delegation point */
> @@ -659,6 +662,12 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)
>   
> +/* cap masks async dir operations */
> +#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
> +#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
> +#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
> +				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
> +
>   int ceph_caps_for_mode(int mode);
>   
>   enum {
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
  2020-01-09 14:49   ` Yan, Zheng
@ 2020-01-09 15:03     ` Jeff Layton
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Layton @ 2020-01-09 15:03 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Thu, 2020-01-09 at 22:49 +0800, Yan, Zheng wrote:
> On 1/6/20 11:35 PM, Jeff Layton wrote:
> > From: "Yan, Zheng" <zyan@redhat.com>
> > 
> > The MDS is getting a new lock-caching facility that will allow it
> > to cache the necessary locks to allow asynchronous directory operations.
> > Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> > we can repurpose those bits for this purpose.
> > 
> > When performing an unlink, if we have Fx on the parent directory,
> > and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> > removed is the primary link, then then we can fire off an unlink
> > request immediately and don't need to wait on reply before returning.
> > 
> > In that situation, just fix up the dcache and link count and return
> > immediately after issuing the call to the MDS. This does mean that we
> > need to hold an extra reference to the inode being unlinked, and extra
> > references to the caps to avoid races. Those references are put and
> > error handling is done in the r_callback routine.
> > 
> > If the operation ends up failing, then set a writeback error on the
> > directory inode that can be fetched later by an fsync on the dir.
> > 
> > Since this feature is very new, also add a new module parameter to
> > enable and disable it (default is disabled).
> 
> author of this patch is not me :)
> 

I did the initial one and then you did some heavy modification of it, so
really it's authored by both of us. But ok, I'll keep the authorship. :)


> > Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c               | 35 ++++++++++++------
> >   fs/ceph/dir.c                | 70 +++++++++++++++++++++++++++++++++---
> >   fs/ceph/inode.c              |  8 ++++-
> >   fs/ceph/super.c              |  4 +++
> >   fs/ceph/super.h              |  3 ++
> >   include/linux/ceph/ceph_fs.h |  9 +++++
> >   6 files changed, 113 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index d05717397c2a..7fc87b693ba4 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
> >   int __ceph_caps_wanted(struct ceph_inode_info *ci)
> >   {
> >   	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> > -	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> > +	if (S_ISDIR(ci->vfs_inode.i_mode)) {
> > +		/* we want EXCL if holding caps of dir ops */
> > +		if (w & CEPH_CAP_ANY_DIR_OPS)
> > +			w |= CEPH_CAP_FILE_EXCL;
> > +	} else {
> >   		/* we want EXCL if dirty data */
> >   		if (w & CEPH_CAP_FILE_BUFFER)
> >   			w |= CEPH_CAP_FILE_EXCL;
> > @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> >   			 * revoking the shared cap on every create/unlink
> >   			 * operation.
> >   			 */
> > -			if (IS_RDONLY(inode))
> > +			if (IS_RDONLY(inode)) {
> >   				want = CEPH_CAP_ANY_SHARED;
> > -			else
> > -				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> > +			} else {
> > +				want = CEPH_CAP_ANY_SHARED |
> > +				       CEPH_CAP_FILE_EXCL |
> > +				       CEPH_CAP_ANY_DIR_OPS;
> > +			}
> >   			retain |= want;
> >   		} else {
> >   
> > @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
> >   				}
> >   				snap_rwsem_locked = true;
> >   			}
> > -			*got = need | (have & want);
> > +			if ((have & want) == want)
> > +				*got = need | want;
> > +			else
> > +				*got = need;
> >   			if (S_ISREG(inode->i_mode) &&
> >   			    (need & CEPH_CAP_FILE_RD) &&
> >   			    !(*got & CEPH_CAP_FILE_CACHE))
> > @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
> >   	int ret;
> >   
> >   	BUG_ON(need & ~CEPH_CAP_FILE_RD);
> > -	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> > -	ret = ceph_pool_perm_check(inode, need);
> > -	if (ret < 0)
> > -		return ret;
> > +	if (need) {
> > +		ret = ceph_pool_perm_check(inode, need);
> > +		if (ret < 0)
> > +			return ret;
> > +	}
> >   
> > -	ret = try_get_cap_refs(inode, need, want, 0,
> > -			       (nonblock ? NON_BLOCKING : 0), got);
> > +	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> > +			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> > +			CEPH_CAP_ANY_DIR_OPS));
> > +	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
> >   	return ret == -EAGAIN ? 0 : ret;
> >   }
> >   
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index d0cd0aba5843..10294f07f5f0 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1036,6 +1036,47 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
> >   	return err;
> >   }
> >   
> > +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> > +				 struct ceph_mds_request *req)
> > +{
> > +	/* If op failed, set error on parent directory */
> > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > +	if (req->r_err)
> > +		printk("%s: req->r_err = %d\n", __func__, req->r_err);
> > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
> > +	iput(req->r_old_inode);
> > +}
> > +
> > +static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	struct ceph_dentry_info *di;
> > +	int ret, want, got;
> > +
> > +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
> > +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> > +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> > +	if (ret != 1 || got != want)
> > +		return false;
> > +
> > +        spin_lock(&dentry->d_lock);
> > +        di = ceph_dentry(dentry);
> > +	/* - We are holding CEPH_CAP_FILE_EXCL, which implies
> > +	 * CEPH_CAP_FILE_SHARED.
> > +	 * - Only support async unlink for primary linkage */
> > Well, you'd have to fix up the caller in the first patch to pass
> > some argument, and
> > +	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
> > +	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
> > +		ret = 0;
> > +        spin_unlock(&dentry->d_lock);
> > +
> > +	if (!ret) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +	return true;
> > +}
> > +
> >   /*
> >    * rmdir and unlink are differ only by the metadata op code
> >    */
> > @@ -1067,13 +1108,33 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
> >   	req->r_dentry = dget(dentry);
> >   	req->r_num_caps = 2;
> >   	req->r_parent = dir;
> > -	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >   	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
> >   	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> >   	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> > -	err = ceph_mdsc_do_request(mdsc, dir, req);
> > -	if (!err && !req->r_reply_info.head->is_dentry)
> > -		d_delete(dentry);
> > +
> > +	if (enable_async_dirops && op == CEPH_MDS_OP_UNLINK &&
> > +	    get_caps_for_async_unlink(dir, dentry)) {
> > +		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
> > +		     dentry->d_name.len, dentry->d_name.name);
> > +		req->r_callback = ceph_async_unlink_cb;
> > +		req->r_old_inode = d_inode(dentry);
> > +		ihold(req->r_old_inode);
> > +		err = ceph_mdsc_submit_request(mdsc, dir, req);
> > +		if (!err) {
> > +			/*
> > +			 * We have enough caps, so we assume that the unlink
> > +			 * will succeed. Fix up the target inode and dcache.
> > +			 */
> > +			drop_nlink(inode);
> > +			d_delete(dentry);
> > +		}
> > +	} else {
> > +		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> > +		err = ceph_mdsc_do_request(mdsc, dir, req);
> > +		if (!err && !req->r_reply_info.head->is_dentry)
> > +			d_delete(dentry);
> > +	}
> > +
> >   	ceph_mdsc_put_request(req);
> >   out:
> >   	return err;
> > @@ -1411,6 +1472,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
> >   	spin_lock(&dentry->d_lock);
> >   	di->time = jiffies;
> >   	di->lease_shared_gen = 0;
> > +	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> >   	__dentry_lease_unlist(di);
> >   	spin_unlock(&dentry->d_lock);
> >   }
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index ef9e8281d485..ffef475af72b 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -1048,6 +1048,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
> >   				  struct ceph_mds_session **old_lease_session)
> >   {
> >   	struct ceph_dentry_info *di = ceph_dentry(dentry);
> > +	unsigned mask = le16_to_cpu(lease->mask);
> >   	long unsigned duration = le32_to_cpu(lease->duration_ms);
> >   	long unsigned ttl = from_time + (duration * HZ) / 1000;
> >   	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
> > @@ -1059,8 +1060,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
> >   	if (ceph_snap(dir) != CEPH_NOSNAP)
> >   		return;
> >   
> > +	if (mask & CEPH_LEASE_PRIMARY_LINK)
> > +		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
> > +	else
> > +		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> > +
> >   	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
> > -	if (duration == 0) {
> > +	if (!(mask & CEPH_LEASE_VALID)) {
> >   		__ceph_dentry_dir_lease_touch(di);
> >   		return;
> >   	}
> > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > index 112927dbd2f2..c149edb6aa2d 100644
> > --- a/fs/ceph/super.c
> > +++ b/fs/ceph/super.c
> > @@ -1324,6 +1324,10 @@ static void __exit exit_ceph(void)
> >   	destroy_caches();
> >   }
> >   
> > +bool enable_async_dirops;
> > +module_param(enable_async_dirops, bool, 0644);
> > +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> > +
> >   module_init(init_ceph);
> >   module_exit(exit_ceph);
> >   
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index 4b53f32b9324..4083de451710 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -72,6 +72,8 @@
> >   #define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT      5  /* cap release delay */
> >   #define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
> >   
> > +extern bool enable_async_dirops;
> > +
> >   struct ceph_mount_options {
> >   	unsigned int flags;
> >   
> > @@ -282,6 +284,7 @@ struct ceph_dentry_info {
> >   #define CEPH_DENTRY_REFERENCED		1
> >   #define CEPH_DENTRY_LEASE_LIST		2
> >   #define CEPH_DENTRY_SHRINK_LIST		4
> > +#define CEPH_DENTRY_PRIMARY_LINK	8
> >   
> >   struct ceph_inode_xattrs_info {
> >   	/*
> > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > index cb21c5cf12c3..a099f60feb7b 100644
> > --- a/include/linux/ceph/ceph_fs.h
> > +++ b/include/linux/ceph/ceph_fs.h
> > @@ -530,6 +530,9 @@ struct ceph_mds_reply_lease {
> >   	__le32 seq;
> >   } __attribute__ ((packed));
> >   
> > +#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
> > +#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
> > +
> >   struct ceph_mds_reply_dirfrag {
> >   	__le32 frag;            /* fragment */
> >   	__le32 auth;            /* auth mds, if this is a delegation point */
> > @@ -659,6 +662,12 @@ int ceph_flags_to_mode(int flags);
> >   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> >   			CEPH_LOCK_IXATTR)
> >   
> > +/* cap masks async dir operations */
> > +#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
> > +#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
> > +#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
> > +				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
> > +
> >   int ceph_caps_for_mode(int mode);
> >   
> >   enum {
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
  2020-01-09 11:34     ` Jeff Layton
@ 2020-01-09 17:53       ` Patrick Donnelly
  2020-01-09 21:30         ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick Donnelly @ 2020-01-09 17:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Ceph Development, Sage Weil, Ilya Dryomov

On Thu, Jan 9, 2020 at 3:34 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2020-01-09 at 17:18 +0800, Yan, Zheng wrote:
> > > +bool enable_async_dirops;
> > > +module_param(enable_async_dirops, bool, 0644);
> > > +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> > > +
> >
> > why not use mount option
> >
>
> I'm open to suggestions here.
>
> I mostly put this in originally to help facilitate performance testing.
> A module option makes it easy to change this at runtime (without having
> to remount or anything).
>
> That said, we probably _do_ want to have a way for users to enable or
> disable this feature. We'll probably want this disabled by default
> initially, but I can forsee that changing once get more confidence.
>
> Mount options are a bit of a pain to deal with over time. If the
> defaults change, we need to document that in the manpages and online
> documentation. If you put a mount option in the fstab, then you have to
> deal with breakage if you boot to an earlier kernel that doesn't support
> that option.
>
> My thinking is that we should just use a module option initially (for
> the really early adopters) and only convert that to a mount option as
> the need for it becomes clear.

Module option makes sense.

A mount option to disable async ops would also make sense. I do not
think the default behavior should be off-by-default.

(Someday perhaps the kernel will be smart enough to also digest the
ceph configs delivered through the monitors...)

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps
  2020-01-09 17:53       ` Patrick Donnelly
@ 2020-01-09 21:30         ` Jeff Layton
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Layton @ 2020-01-09 21:30 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: Yan, Zheng, Ceph Development, Sage Weil, Ilya Dryomov

On Thu, 2020-01-09 at 09:53 -0800, Patrick Donnelly wrote:
> On Thu, Jan 9, 2020 at 3:34 AM Jeff Layton <jlayton@kernel.org> wrote:
> > On Thu, 2020-01-09 at 17:18 +0800, Yan, Zheng wrote:
> > > > +bool enable_async_dirops;
> > > > +module_param(enable_async_dirops, bool, 0644);
> > > > +MODULE_PARM_DESC(enable_async_dirops, "Asynchronous directory operations enabled");
> > > > +
> > > 
> > > why not use mount option
> > > 
> > 
> > I'm open to suggestions here.
> > 
> > I mostly put this in originally to help facilitate performance testing.
> > A module option makes it easy to change this at runtime (without having
> > to remount or anything).
> > 
> > That said, we probably _do_ want to have a way for users to enable or
> > disable this feature. We'll probably want this disabled by default
> > initially, but I can forsee that changing once get more confidence.
> > 
> > Mount options are a bit of a pain to deal with over time. If the
> > defaults change, we need to document that in the manpages and online
> > documentation. If you put a mount option in the fstab, then you have to
> > deal with breakage if you boot to an earlier kernel that doesn't support
> > that option.
> > 
> > My thinking is that we should just use a module option initially (for
> > the really early adopters) and only convert that to a mount option as
> > the need for it becomes clear.
> 
> Module option makes sense.
> 
> A mount option to disable async ops would also make sense. I do not
> think the default behavior should be off-by-default.
> 
> (Someday perhaps the kernel will be smart enough to also digest the
> ceph configs delivered through the monitors...)
> 

I do think it should be off for now until we have some early adopters
play with it, and make sure it doesn't fall down in unexpected ways.

Once we've had it in place for a while (a couple of kernel releases?),
and some success with it, I'd be open to flipping the default to on
though.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] ceph: hold extra reference to r_parent over life of request
  2020-01-09 13:33         ` Jeff Layton
@ 2020-01-10  1:41           ` Xiubo Li
  0 siblings, 0 replies; 12+ messages in thread
From: Xiubo Li @ 2020-01-10  1:41 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, zyan, pdonnell

On 2020/1/9 21:33, Jeff Layton wrote:
> On Thu, 2020-01-09 at 21:16 +0800, Xiubo Li wrote:
>> On 2020/1/9 19:20, Jeff Layton wrote:
>>> On Thu, 2020-01-09 at 10:05 +0800, Xiubo Li wrote:
>>>> On 2020/1/6 23:35, Jeff Layton wrote:
>>>>> Currently, we just assume that it will stick around by virtue of the
>>>>> submitter's reference, but later patches will allow the syscall to
>>>>> return early and we can't rely on that reference at that point.
>>>>>
>>>>> Take an extra reference to the inode when setting r_parent and release
>>>>> it when releasing the request.
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>>     fs/ceph/mds_client.c | 8 ++++++--
>>>>>     1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>> index 94cce2ab92c4..b7122f682678 100644
>>>>> --- a/fs/ceph/mds_client.c
>>>>> +++ b/fs/ceph/mds_client.c
>>>>> @@ -708,8 +708,10 @@ void ceph_mdsc_release_request(struct kref *kref)
>>>>>     		/* avoid calling iput_final() in mds dispatch threads */
>>>>>     		ceph_async_iput(req->r_inode);
>>>>>     	}
>>>>> -	if (req->r_parent)
>>>>> +	if (req->r_parent) {
>>>>>     		ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
>>>>> +		ceph_async_iput(req->r_parent);
>>>>> +	}
>>>>>     	ceph_async_iput(req->r_target_inode);
>>>>>     	if (req->r_dentry)
>>>>>     		dput(req->r_dentry);
>>>>> @@ -2706,8 +2708,10 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>>>>>     	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
>>>>>     	if (req->r_inode)
>>>>>     		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
>>>>> -	if (req->r_parent)
>>>>> +	if (req->r_parent) {
>>>>>     		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
>>>>> +		ihold(req->r_parent);
>>>>> +	}
>>>> This might also fix another issue when the mdsc request is timedout and
>>>> returns to the vfs, then the r_parent maybe released in vfs. And then if
>>>> we reference it again in mdsc handle_reply() -->
>>>> ceph_mdsc_release_request(),  some unknown issues may happen later ??
>>>>
>>> AIUI, when a timeout occurs, the req is unhashed such that handle_reply
>>> can't find it. So, I doubt this affects that one way or another.
>> If my understanding is correct, such as for rmdir(), the logic will be :
>>
>> req = ceph_mdsc_create_request()      //  ref == 1
>>
>> ceph_mdsc_do_request(req) -->
>>
>>           ceph_mdsc_submit_request(req) -->
>>
>>                   __register_request(req) // ref == 2
>>
>>           ceph_mdsc_wait_request(req)  // If timedout
>>
>> ceph_mdsc_put_request(req)  // ref == 1
>>
>> Then in handled_reply(), only when we get a safe reply it will call
>> __unregister_request(req), then the ref could be 0.
>>
>> Though it will ihold()/ceph_async_iput() the req->r_unsafe_dir(=
>> req->r_parent) , but the _iput() will be called just before we reference
>> the req->r_parent in the _relase_request(). And the _iput() here may
>> call the iput_final().
>>
> I take it back, I think you're right. This likely would fix that issue
> up. I'll plan to add a note about that to the changelog before I merge
> it. Should we mark this for stable in light of that?

Yeah, right :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-10  1:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200106153520.307523-1-jlayton@kernel.org>
     [not found] ` <20200106153520.307523-3-jlayton@kernel.org>
2020-01-09  2:05   ` [PATCH 2/6] ceph: hold extra reference to r_parent over life of request Xiubo Li
2020-01-09 11:20     ` Jeff Layton
2020-01-09 13:16       ` Xiubo Li
2020-01-09 13:33         ` Jeff Layton
2020-01-10  1:41           ` Xiubo Li
2020-01-09 13:58 ` [PATCH 0/6] ceph: asynchronous unlink support Yan, Zheng
     [not found] ` <20200106153520.307523-7-jlayton@kernel.org>
2020-01-09  9:18   ` [PATCH 6/6] ceph: perform asynchronous unlink if we have sufficient caps Yan, Zheng
2020-01-09 11:34     ` Jeff Layton
2020-01-09 17:53       ` Patrick Donnelly
2020-01-09 21:30         ` Jeff Layton
2020-01-09 14:49   ` Yan, Zheng
2020-01-09 15:03     ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.