Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: Max Kellermann <mk@cm4all.com>
Cc: zhangliguang@linux.alibaba.com, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, trond.myklebust@hammerspace.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
Date: Fri, 12 Jul 2019 17:28:13 +0200
Message-ID: <20190712152813.GB13940@kroah.com> (raw)
In-Reply-To: <20190712141806.3063-1-mk@cm4all.com>

On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote:
> This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.
> 
> That commit caused a severe memory leak in nfs_readdir_make_qstr().
> 
> When listing a directory with more than 100 files (this is how many
> struct nfs_cache_array_entry elements fit in one 4kB page), all
> allocated file name strings past those 100 leak.
> 
> The root of the leakage is that those string pointers are managed in
> pages which are never linked into the page cache.
> 
> fs/nfs/dir.c puts pages into the page cache by calling
> read_cache_page(); the callback function nfs_readdir_filler() will
> then fill the given page struct which was passed to it, which is
> already linked in the page cache (by do_read_cache_page() calling
> add_to_page_cache_lru()).
> 
> Commit be4c2d4723a4 added another (local) array of allocated pages, to
> be filled with more data, instead of discarding excess items received
> from the NFS server.  Those additional pages can be used by the next
> nfs_readdir_filler() call (from within the same nfs_readdir() call).
> 
> The leak happens when some of those additional pages are never used
> (copied to the page cache using copy_highpage()).  The pages will be
> freed by nfs_readdir_free_pages(), but their contents will not.  The
> commit did not invoke nfs_readdir_clear_array() (and doing so would
> have been dangerous, because it did not track which of those pages
> were already copied to the page cache, risking double free bugs).
> 
> How to reproduce the leak:
> 
> - Use a kernel with CONFIG_SLUB_DEBUG_ON.
> 
> - Create a directory on a NFS mount with more than 100 files with
>   names long enough to use the "kmalloc-32" slab (so we can easily
>   look up the allocation counts):
> 
>   for i in `seq 110`; do touch ${i}_0123456789abcdef; done
> 
> - Drop all caches:
> 
>   echo 3 >/proc/sys/vm/drop_caches
> 
> - Check the allocation counter:
> 
>   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
>   30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1
> 
> - Request a directory listing and check the allocation counters again:
> 
>   ls
>   [...]
>   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
>   30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1
> 
> There are now 120 new allocations.
> 
> - Drop all caches and check the counters again:
> 
>   echo 3 >/proc/sys/vm/drop_caches
>   grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
>   30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1
> 
> 110 allocations are gone, but 10 have leaked and will never be freed.
> 
> Unhelpfully, those allocations are explicitly excluded from KMEMLEAK,
> that's why my initial attempts with KMEMLEAK were not successful:
> 
> 	/*
> 	 * Avoid a kmemleak false positive. The pointer to the name is stored
> 	 * in a page cache page which kmemleak does not scan.
> 	 */
> 	kmemleak_not_leak(string->name);
> 
> It would be possible to solve this bug without reverting the whole
> commit:
> 
> - keep track of which pages were not used, and call
>   nfs_readdir_clear_array() on them, or
> - manually link those pages into the page cache
> 
> But for now I have decided to just revert the commit, because the real
> fix would require complex considerations, risking more dangerous
> (crash) bugs, which may seem unsuitable for the stable branches.
> 
> Signed-off-by: Max Kellermann <mk@cm4all.com>
> ---
>  fs/nfs/dir.c      | 90 ++++-------------------------------------------
>  fs/nfs/internal.h |  3 +-
>  2 files changed, 7 insertions(+), 86 deletions(-)

No cc: stable@vger on this patch to get it backported?

thanks,

greg k-h

  reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-12 14:18 Max Kellermann
2019-07-12 15:28 ` Greg KH [this message]
2019-07-12 20:16   ` Trond Myklebust

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190712152813.GB13940@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mk@cm4all.com \
    --cc=trond.myklebust@hammerspace.com \
    --cc=zhangliguang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org linux-nfs@archiver.kernel.org
	public-inbox-index linux-nfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox