Linux-Fsdevel Archive on
 help / color / Atom feed
From: Ira Weiny <>
Cc: Andrew Morton <>,, Al Viro <>,
	Christian Benvenuti <>,
	Christoph Hellwig <>,
	Christopher Lameter <>,
	Dan Williams <>,
	Dave Chinner <>,
	Dennis Dalessandro <>,
	Doug Ledford <>, Jan Kara <>,
	Jason Gunthorpe <>,
	Jerome Glisse <>,
	Matthew Wilcox <>,
	Michal Hocko <>,
	Mike Rapoport <>,
	Mike Marciniszyn <>,
	Ralph Campbell <>,
	Tom Talpey <>, LKML <>,, John Hubbard <>
Subject: Re: [PATCH 0/2] mm: put_user_page() call site conversion first
Date: Thu, 14 Feb 2019 16:23:12 -0800
Message-ID: <> (raw)
In-Reply-To: <>

On Thu, Feb 07, 2019 at 11:56:47PM -0800, wrote:
> From: John Hubbard <>
> Hi,
> It seems about time to post these initial patches: I think we have pretty
> good consensus on the concept and details of the put_user_pages() approach.
> Therefore, here are the first two patches, to get started on converting the
> get_user_pages() call sites to use put_user_page(), instead of put_page().
> This is in order to implement tracking of get_user_page() pages.
> A discussion of the overall problem is below.
> As mentioned in patch 0001, the steps are to fix the problem are:
> 1) Provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> This write up is lifted from the RFC v2 patchset cover letter [1]:
> Overview
> ========
> Some kernel components (file systems, device drivers) need to access
> memory that is specified via process virtual address. For a long time, the
> API to achieve that was get_user_pages ("GUP") and its variations. However,
> GUP has critical limitations that have been overlooked; in particular, GUP
> does not interact correctly with filesystems in all situations. That means
> that file-backed memory + GUP is a recipe for potential problems, some of
> which have already occurred in the field.
> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.
> Over the years, more GUP uses have appeared (virtualization, device
> drivers, RDMA) that can keep the pages they get via GUP for a long period
> of time (seconds, minutes, hours, days, ...). This long-term pinning makes
> an underlying design problem more obvious.
> In fact, there are a number of key problems inherent to GUP:
> Interactions with file systems
> ==============================
> File systems expect to be able to write back data, both to reclaim pages,
> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> ...which is due to the file system asserting that there are still buffer
> heads attached:
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> Dave Chinner's description of this is very clear:
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that pattern has been
> going on since about 2005 or so.
> Long term GUP
> =============
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to filesystem corruption. What happens is that when a file-backed page is
> being written back, it is first mapped read-only in all of the CPU page
> tables; the file system then assumes that nobody can write to the page, and
> that the page content is therefore stable. Unfortunately, the GUP callers
> generally do not monitor changes to the CPU pages tables; they instead
> assume that the following pattern is safe (it's not):
>     get_user_pages()
>     Hardware can keep a reference to those pages for a very long time,
>     and write to it at any time. Because "hardware" here means "devices
>     that are not a CPU", this activity occurs without any interaction
>     with the kernel's file system code.
>     for each page
>         set_page_dirty
>         put_page()
> In fact, the GUP documentation even recommends that pattern.
> Anyway, the file system assumes that the page is stable (nothing is writing
> to the page), and that is a problem: stable page content is necessary for
> many filesystem actions during writeback, such as checksum, encryption,
> RAID striping, etc. Furthermore, filesystem features like COW (copy on
> write) or snapshot also rely on being able to use a new page for as memory
> for that memory range inside the file.
> Corruption during write back is clearly possible here. To solve that, one
> idea is to identify pages that have active GUP, so that we can use a bounce
> page to write stable data to the filesystem. The filesystem would work
> on the bounce page, while any of the active GUP might write to the
> original page. This would avoid the stable page violation problem, but note
> that it is only part of the overall solution, because other problems
> remain.
> Other filesystem features that need to replace the page with a new one can
> be inhibited for pages that are GUP-pinned. This will, however, alter and
> limit some of those filesystem features. The only fix for that would be to
> require GUP users to monitor and respond to CPU page table updates.
> Subsystems such as ODP and HMM do this, for example. This aspect of the
> problem is still under discussion.
> Direct IO
> =========
> Direct IO can cause corruption, if userspace does Direct-IO that writes to
> a range of virtual addresses that are mmap'd to a file.  The pages written
> to are file-backed pages that can be under write back, while the Direct IO
> is taking place.  Here, Direct IO races with a write back: it calls
> GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> The race window is pretty small, which is probably why years have gone by
> before we noticed this problem: Direct IO is generally very quick, and
> tends to finish up before the filesystem gets around to do anything with
> the page contents.  However, it's still a real problem.  The solution is
> to never let GUP return pages that are under write back, but instead,
> force GUP to take a write fault on those pages.  That way, GUP will
> properly synchronize with the active write back.  This does not change the
> required GUP behavior, it just avoids that race.
> [1]
> Cc: Christian Benvenuti <>
> Cc: Christoph Hellwig <>
> Cc: Christopher Lameter <>
> Cc: Dan Williams <>
> Cc: Dave Chinner <>
> Cc: Dennis Dalessandro <>
> Cc: Doug Ledford <>
> Cc: Jan Kara <>
> Cc: Jason Gunthorpe <>
> Cc: Jérôme Glisse <>
> Cc: Matthew Wilcox <>
> Cc: Michal Hocko <>
> Cc: Mike Rapoport <>
> Cc: Mike Marciniszyn <>
> Cc: Ralph Campbell <>
> Cc: Tom Talpey <>
> John Hubbard (2):
>   mm: introduce put_user_page*(), placeholder versions
>   infiniband/mm: convert put_page() to put_user_page*()

A bit late but, FWIW:

Reviewed-by: Ira Weiny <>

John these are the pages sitting in your gup_dma/first_steps branch here,

>  drivers/infiniband/core/umem.c              |  7 +-
>  drivers/infiniband/core/umem_odp.c          |  2 +-
>  drivers/infiniband/hw/hfi1/user_pages.c     | 11 +--
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c  | 11 +--
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 +-
>  include/linux/mm.h                          | 24 ++++++
>  mm/swap.c                                   | 82 +++++++++++++++++++++
>  9 files changed, 129 insertions(+), 27 deletions(-)
> -- 
> 2.20.1

  parent reply index

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-08  7:56 john.hubbard
2019-02-08  7:56 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
2019-02-08 10:32   ` Mike Rapoport
2019-02-08 20:44     ` John Hubbard
2019-02-08  7:56 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2019-02-15  0:23 ` Ira Weiny [this message]
2019-02-15  0:54   ` [PATCH 0/2] mm: put_user_page() call site conversion first John Hubbard

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on

Archives are clonable:
	git clone --mirror linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ \
	public-inbox-index linux-fsdevel

Newsgroup available over NNTP:

AGPL code for this site: git clone public-inbox