All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Simmons <jsimmons@infradead.org>
To: Andreas Dilger <adilger@whamcloud.com>,
	Oleg Drokin <green@whamcloud.com>, NeilBrown <neilb@suse.de>
Cc: Wang Shilong <wshilong@ddn.com>,
	Lustre Development List <lustre-devel@lists.lustre.org>
Subject: [lustre-devel] [PATCH 02/41] lustre: llite: make readahead aware of hints
Date: Sun,  4 Apr 2021 20:50:31 -0400	[thread overview]
Message-ID: <1617583870-32029-3-git-send-email-jsimmons@infradead.org> (raw)
In-Reply-To: <1617583870-32029-1-git-send-email-jsimmons@infradead.org>

From: Wang Shilong <wshilong@ddn.com>

Calling madvise(MADV_SEQUENTIAL) and madvise(MADV_RANDOM) sets the
VM_SEQ_READ and VM_RAND_READ hints in vma->vm_flags.  These should
be used to guide the Lustre readahead for better performance.

Disable the kernel readahead for mmap() pages and use the llite
readahead instead.  There was also a bug in __ll_fault() that would
set both VM_SEQ_READ and VM_RAND_READ at the same time, which was
confusing the detection of the VM_SEQ_READ case, since VM_RAND_READ
was being checked first.

This changes the readahead for mmap from submitting mostly 4KB RPCs
to a large number of 1MB RPCs for the application profiled:

  llite.*.read_ahead_stats     before        patched
  ------------------------     ------        -------
  hits                           2408         135924 samples [pages]
  misses                        34160           2384 samples [pages]

  osc.*.rpc_stats           read before    read patched
  ---------------          -------------  --------------
  pages per rpc            rpcs   % cum%   rpcs   % cum%
     1:                    6542  95  95     351  55  55
     2:                     224   3  99      76  12  67
     4:                      32   0  99      28   4  72
     8:                       2   0  99       9   1  73
    16:                      25   0  99      32   5  78
    32:                       0   0  99       8   1  80
    64:                       0   0  99       5   0  80
   128:                       0   0  99      15   2  83
   256:                       2   0  99     102  16  99
   512:                       0   0  99       0   0  99
  1024:                       1   0 100       3   0 100

Readahead hit rate improved from 6% to 98%, and 4KB RPCs dropped from
95% to 55% and 1MB+ RPCs increased from 0% to 16% (79% of all pages).

Add debug to ll_file_mmap(), ll_fault() and ll_fault_io_init() to
allow tracing VMA state functions for future IO optimizations.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13669
Lustre-commit: 7542820698696ed ("LU-13669 llite: make readahead aware of hints")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/41228
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h | 10 +++++++++-
 fs/lustre/llite/file.c        |  2 ++
 fs/lustre/llite/llite_mmap.c  | 42 ++++++++++++++++++++++--------------------
 fs/lustre/llite/rw.c          | 20 ++++++++++++++++----
 4 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 4f34e5d..739fe5b 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1974,7 +1974,15 @@ struct cl_io {
 	 * the read IO will check to-be-read OSCs' status, and make fast-switch
 	 * another mirror if some of the OSTs are not healthy.
 	 */
-				ci_tried_all_mirrors:1;
+				ci_tried_all_mirrors:1,
+	/**
+	 * Random read hints, readahead will be disabled.
+	 */
+				ci_rand_read:1,
+	/**
+	 * Sequential read hints.
+	 */
+				ci_seq_read:1;
 	/**
 	 * Bypass quota check
 	 */
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 7c7ac01..fd01e14 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -736,6 +736,8 @@ static int ll_local_open(struct file *file, struct lookup_intent *it,
 	file->private_data = fd;
 	ll_readahead_init(inode, &fd->fd_ras);
 	fd->fd_omode = it->it_flags & (FMODE_READ | FMODE_WRITE | FMODE_EXEC);
+	/* turn off the kernel's read-ahead */
+	file->f_ra.ra_pages = 0;
 
 	/* ll_cl_context initialize */
 	rwlock_init(&fd->fd_lock);
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index f0be7ba..b9a73e0 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -84,13 +84,11 @@ struct vm_area_struct *our_vma(struct mm_struct *mm, unsigned long addr,
  * @vma		virtual memory area addressed to page fault
  * @env		corespondent lu_env to processing
  * @index	page index corespondent to fault.
- * @ra_flags	vma readahead flags.
  *
- * \return error codes from cl_io_init.
+ * RETURN	error codes from cl_io_init.
  */
 static struct cl_io *
-ll_fault_io_init(struct lu_env *env, struct vm_area_struct *vma,
-		 pgoff_t index, unsigned long *ra_flags)
+ll_fault_io_init(struct lu_env *env, struct vm_area_struct *vma, pgoff_t index)
 {
 	struct file *file = vma->vm_file;
 	struct inode *inode = file_inode(file);
@@ -110,18 +108,15 @@ struct vm_area_struct *our_vma(struct mm_struct *mm, unsigned long addr,
 	fio->ft_index = index;
 	fio->ft_executable = vma->vm_flags & VM_EXEC;
 
-	/*
-	 * disable VM_SEQ_READ and use VM_RAND_READ to make sure that
-	 * the kernel will not read other pages not covered by ldlm in
-	 * filemap_nopage. we do our readahead in ll_readpage.
-	 */
-	if (ra_flags)
-		*ra_flags = vma->vm_flags & (VM_RAND_READ | VM_SEQ_READ);
-	vma->vm_flags &= ~VM_SEQ_READ;
-	vma->vm_flags |= VM_RAND_READ;
+	CDEBUG(D_MMAP,
+	       DFID": vma=%p start=%#lx end=%#lx vm_flags=%#lx idx=%lu\n",
+	       PFID(&ll_i2info(inode)->lli_fid), vma, vma->vm_start,
+	       vma->vm_end, vma->vm_flags, fio->ft_index);
 
-	CDEBUG(D_MMAP, "vm_flags: %lx (%lu %d)\n", vma->vm_flags,
-	       fio->ft_index, fio->ft_executable);
+	if (vma->vm_flags & VM_SEQ_READ)
+		io->ci_seq_read = 1;
+	else if (vma->vm_flags & VM_RAND_READ)
+		io->ci_rand_read = 1;
 
 	rc = cl_io_init(env, io, CIT_FAULT, io->ci_obj);
 	if (rc == 0) {
@@ -161,7 +156,7 @@ static int __ll_page_mkwrite(struct vm_area_struct *vma, struct page *vmpage,
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	io = ll_fault_io_init(env, vma, vmpage->index, NULL);
+	io = ll_fault_io_init(env, vma, vmpage->index);
 	if (IS_ERR(io)) {
 		result = PTR_ERR(io);
 		goto out;
@@ -277,7 +272,6 @@ static vm_fault_t __ll_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct cl_io *io;
 	struct vvp_io *vio = NULL;
 	struct page *vmpage;
-	unsigned long ra_flags;
 	int result = 0;
 	vm_fault_t fault_ret = 0;
 	u16 refcheck;
@@ -314,7 +308,7 @@ static vm_fault_t __ll_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		fault_ret = 0;
 	}
 
-	io = ll_fault_io_init(env, vma, vmf->pgoff, &ra_flags);
+	io = ll_fault_io_init(env, vma, vmf->pgoff);
 	if (IS_ERR(io)) {
 		fault_ret = to_fault_error(PTR_ERR(io));
 		goto out;
@@ -350,8 +344,6 @@ static vm_fault_t __ll_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	cl_io_fini(env, io);
 
-	vma->vm_flags |= ra_flags;
-
 out:
 	cl_env_put(env, &refcheck);
 	if (result != 0 && !(fault_ret & VM_FAULT_RETRY))
@@ -375,6 +367,10 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	if (cached)
 		goto out;
 
+	CDEBUG(D_MMAP, DFID": vma=%p start=%#lx end=%#lx vm_flags=%#lx\n",
+	       PFID(&ll_i2info(file_inode(vma->vm_file))->lli_fid),
+	       vma, vma->vm_start, vma->vm_end, vma->vm_flags);
+
 	/* Only SIGKILL and SIGTERM are allowed for fault/nopage/mkwrite
 	 * so that it can be killed by admin but not cause segfault by
 	 * other signals.
@@ -385,6 +381,7 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	/* make sure offset is not a negative number */
 	if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return VM_FAULT_SIGBUS;
+
 restart:
 	result = __ll_fault(vmf->vma, vmf);
 	if (vmf->page &&
@@ -545,6 +542,11 @@ int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 	bool cached;
 	int rc;
 
+	CDEBUG(D_VFSTRACE | D_MMAP,
+	       "VFS_Op: fid="DFID" vma=%p start=%#lx end=%#lx vm_flags=%#lx\n",
+	       PFID(&ll_i2info(inode)->lli_fid),
+	       vma, vma->vm_start, vma->vm_end, vma->vm_flags);
+
 	if (ll_file_nolock(file))
 		return -EOPNOTSUPP;
 
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 096e015..8bba97f 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -1255,7 +1255,7 @@ static bool index_in_stride_window(struct ll_readahead_state *ras,
  */
 static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		       struct ll_readahead_state *ras, pgoff_t index,
-		       enum ras_update_flags flags)
+		       enum ras_update_flags flags, struct cl_io *io)
 {
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
 	bool hit = flags & LL_RAS_HIT;
@@ -1276,6 +1276,18 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 	if (ras->ras_no_miss_check)
 		goto out_unlock;
 
+	if (io && io->ci_rand_read)
+		goto out_unlock;
+
+	if (io && io->ci_seq_read) {
+		if (!hit) {
+			/* to avoid many small read RPC here */
+			ras->ras_window_pages = sbi->ll_ra_info.ra_range_pages;
+			ll_ra_stats_inc_sbi(sbi, RA_STAT_MMAP_RANGE_READ);
+		}
+		goto skip;
+	}
+
 	if (flags & LL_RAS_MMAP) {
 		unsigned long ra_pages;
 
@@ -1594,7 +1606,7 @@ int ll_io_read_page(const struct lu_env *env, struct cl_io *io,
 			flags |= LL_RAS_HIT;
 		if (!vio->vui_ra_valid)
 			flags |= LL_RAS_MMAP;
-		ras_update(sbi, inode, ras, vvp_index(vpg), flags);
+		ras_update(sbi, inode, ras, vvp_index(vpg), flags, io);
 	}
 
 	cl_2queue_init(queue);
@@ -1613,7 +1625,7 @@ int ll_io_read_page(const struct lu_env *env, struct cl_io *io,
 	io_start_index = cl_index(io->ci_obj, io->u.ci_rw.crw_pos);
 	io_end_index = cl_index(io->ci_obj, io->u.ci_rw.crw_pos +
 				io->u.ci_rw.crw_count - 1);
-	if (ll_readahead_enabled(sbi) && ras) {
+	if (ll_readahead_enabled(sbi) && ras && !io->ci_rand_read) {
 		pgoff_t skip_index = 0;
 
 		if (ras->ras_next_readahead_idx < vvp_index(vpg))
@@ -1802,7 +1814,7 @@ int ll_readpage(struct file *file, struct page *vmpage)
 			 * if the page is hit in cache because non cache page
 			 * case will be handled by slow read later.
 			 */
-			ras_update(sbi, inode, ras, vvp_index(vpg), flags);
+			ras_update(sbi, inode, ras, vvp_index(vpg), flags, io);
 			/* avoid duplicate ras_update() call */
 			vpg->vpg_ra_updated = 1;
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

  parent reply	other threads:[~2021-04-05  0:51 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-05  0:50 [lustre-devel] [PATCH 00/41] lustre: sync to OpenSFS branch as of March 1 James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 01/41] lustre: llite: data corruption due to RPC reordering James Simmons
2021-04-05  0:50 ` James Simmons [this message]
2021-04-05  0:50 ` [lustre-devel] [PATCH 03/41] lustre: lov: avoid NULL dereference in cleanup James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 04/41] lustre: llite: quiet spurious ioctl warning James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 05/41] lustre: ptlrpc: do not output error when imp_sec is freed James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 06/41] lustre: update version to 2.14.0 James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 07/41] lnet: UDSP storage and marshalled structs James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 08/41] lnet: foundation patch for selection mod James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 09/41] lnet: Preferred gateway selection James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 10/41] lnet: Select NI/peer NI with highest prio James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 11/41] lnet: select best peer and local net James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 12/41] lnet: UDSP handling James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 13/41] lnet: Apply UDSP on local and remote NIs James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 14/41] lnet: Add the kernel level Marshalling API James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 15/41] lnet: Add the kernel level De-Marshalling API James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 16/41] lnet: Add the ioctl handler for "add policy" James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 17/41] lnet: ioctl handler for "delete policy" James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 18/41] lnet: ioctl handler for get policy info James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 19/41] lustre: update version to 2.14.50 James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 20/41] lustre: gss: handle empty reqmsg in sptlrpc_req_ctx_switch James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 21/41] lustre: sec: file ioctls to handle encryption policies James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 22/41] lustre: obdclass: try to skip corrupted llog records James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 23/41] lustre: lov: fix layout generation inc for mirror split James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 24/41] lnet: modify assertion in lnet_post_send_locked James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 25/41] lustre: lov: fixes bitfield in lod qos code James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 26/41] lustre: lov: grant deadlock if same OSC in two components James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 27/41] lustre: change EWOULDBLOCK to EAGAIN James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 28/41] lsutre: ldlm: return error from ldlm_namespace_new() James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 29/41] lustre: llite: remove unused ll_teardown_mmaps() James Simmons
2021-04-05  0:50 ` [lustre-devel] [PATCH 30/41] lustre: lov: style cleanups in lov_set_osc_active() James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 31/41] lustre: change various operations structs to const James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 32/41] lustre: mark strings in char arrays as const James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 33/41] lustre: convert snprintf to scnprintf as appropriate James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 34/41] lustre: remove non-static 'inline' markings James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 35/41] lustre: llite: use is_root_inode() James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 36/41] lnet: libcfs: discard cfs_firststr James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 37/41] lnet: place wire protocol data int own headers James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 38/41] lnet: libcfs: use wait_event_timeout() in tracefiled() James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 39/41] lnet: use init_wait() rather than init_waitqueue_entry() James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 40/41] lnet: discard LNET_MD_PHYS James Simmons
2021-04-05  0:51 ` [lustre-devel] [PATCH 41/41] lnet: o2iblnd: convert peers hash table to hashtable.h James Simmons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1617583870-32029-3-git-send-email-jsimmons@infradead.org \
    --to=jsimmons@infradead.org \
    --cc=adilger@whamcloud.com \
    --cc=green@whamcloud.com \
    --cc=lustre-devel@lists.lustre.org \
    --cc=neilb@suse.de \
    --cc=wshilong@ddn.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.