linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] docs: Add some struct file refcounting information
@ 2019-02-21 19:02 Jonathan Corbet
  2019-02-21 19:02 ` [PATCH 1/2] docs: Bring some order to filesystem documentation Jonathan Corbet
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Jonathan Corbet @ 2019-02-21 19:02 UTC (permalink / raw)
  To: linux-doc; +Cc: linux-kernel, Al Viro, axboe, Jonathan Corbet

Al Viro's recent "braindump" on how struct file refcounting works is too
good to lose, so let's capture it in Documentation/.  There are two changes
here:

 - Clean up the filesystem book, move stuff around, and generally try to
   create some order here.  No text changes other than some connecting
   text.

 - Bring in Al's text, with a fair amount of language tweaking and later
   corrections applied.

Hopefully this is just the beginning.

Jonathan Corbet (2):
  docs: Bring some order to filesystem documentation
  docs: Add struct file refcounting and SCM_RIGHTS mess info

 Documentation/filesystems/api-summary.rst | 150 ++++++++
 Documentation/filesystems/index.rst       | 395 ++--------------------
 Documentation/filesystems/journalling.rst | 184 ++++++++++
 Documentation/filesystems/lifecycles.rst  | 357 +++++++++++++++++++
 Documentation/filesystems/path-lookup.rst |  15 +
 Documentation/filesystems/splice.rst      |  22 ++
 6 files changed, 753 insertions(+), 370 deletions(-)
 create mode 100644 Documentation/filesystems/api-summary.rst
 create mode 100644 Documentation/filesystems/journalling.rst
 create mode 100644 Documentation/filesystems/lifecycles.rst
 create mode 100644 Documentation/filesystems/splice.rst

-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] docs: Bring some order to filesystem documentation
  2019-02-21 19:02 [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
@ 2019-02-21 19:02 ` Jonathan Corbet
  2019-02-21 19:02 ` [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info Jonathan Corbet
  2019-02-21 19:06 ` [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
  2 siblings, 0 replies; 5+ messages in thread
From: Jonathan Corbet @ 2019-02-21 19:02 UTC (permalink / raw)
  To: linux-doc; +Cc: linux-kernel, Al Viro, axboe, Jonathan Corbet

Documentation/filesystems is, like much of the rest of the kernel's
documentation, a jumble of unorganized information.  Split the
documentation into categories and try to bring some order to the top-level
index.rst files.  No text changes other than a few section-introductory
blurbs; this is all just moving stuff around.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/filesystems/api-summary.rst | 150 ++++++++
 Documentation/filesystems/index.rst       | 394 ++--------------------
 Documentation/filesystems/journalling.rst | 184 ++++++++++
 Documentation/filesystems/path-lookup.rst |  15 +
 Documentation/filesystems/splice.rst      |  22 ++
 5 files changed, 395 insertions(+), 370 deletions(-)
 create mode 100644 Documentation/filesystems/api-summary.rst
 create mode 100644 Documentation/filesystems/journalling.rst
 create mode 100644 Documentation/filesystems/splice.rst

diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
new file mode 100644
index 000000000000..aa51ffcfa029
--- /dev/null
+++ b/Documentation/filesystems/api-summary.rst
@@ -0,0 +1,150 @@
+=============================
+Linux Filesystems API summary
+=============================
+
+This section contains API-level documentation, mostly taken from the source
+code itself.
+
+The Linux VFS
+=============
+
+The Filesystem types
+--------------------
+
+.. kernel-doc:: include/linux/fs.h
+   :internal:
+
+The Directory Cache
+-------------------
+
+.. kernel-doc:: fs/dcache.c
+   :export:
+
+.. kernel-doc:: include/linux/dcache.h
+   :internal:
+
+Inode Handling
+--------------
+
+.. kernel-doc:: fs/inode.c
+   :export:
+
+.. kernel-doc:: fs/bad_inode.c
+   :export:
+
+Registration and Superblocks
+----------------------------
+
+.. kernel-doc:: fs/super.c
+   :export:
+
+File Locks
+----------
+
+.. kernel-doc:: fs/locks.c
+   :export:
+
+.. kernel-doc:: fs/locks.c
+   :internal:
+
+Other Functions
+---------------
+
+.. kernel-doc:: fs/mpage.c
+   :export:
+
+.. kernel-doc:: fs/namei.c
+   :export:
+
+.. kernel-doc:: fs/buffer.c
+   :export:
+
+.. kernel-doc:: block/bio.c
+   :export:
+
+.. kernel-doc:: fs/seq_file.c
+   :export:
+
+.. kernel-doc:: fs/filesystems.c
+   :export:
+
+.. kernel-doc:: fs/fs-writeback.c
+   :export:
+
+.. kernel-doc:: fs/block_dev.c
+   :export:
+
+.. kernel-doc:: fs/anon_inodes.c
+   :export:
+
+.. kernel-doc:: fs/attr.c
+   :export:
+
+.. kernel-doc:: fs/d_path.c
+   :export:
+
+.. kernel-doc:: fs/dax.c
+   :export:
+
+.. kernel-doc:: fs/direct-io.c
+   :export:
+
+.. kernel-doc:: fs/file_table.c
+   :export:
+
+.. kernel-doc:: fs/libfs.c
+   :export:
+
+.. kernel-doc:: fs/posix_acl.c
+   :export:
+
+.. kernel-doc:: fs/stat.c
+   :export:
+
+.. kernel-doc:: fs/sync.c
+   :export:
+
+.. kernel-doc:: fs/xattr.c
+   :export:
+
+The proc filesystem
+===================
+
+sysctl interface
+----------------
+
+.. kernel-doc:: kernel/sysctl.c
+   :export:
+
+proc filesystem interface
+-------------------------
+
+.. kernel-doc:: fs/proc/base.c
+   :internal:
+
+Events based on file descriptors
+================================
+
+.. kernel-doc:: fs/eventfd.c
+   :export:
+
+The Filesystem for Exporting Kernel Objects
+===========================================
+
+.. kernel-doc:: fs/sysfs/file.c
+   :export:
+
+.. kernel-doc:: fs/sysfs/symlink.c
+   :export:
+
+The debugfs filesystem
+======================
+
+debugfs interface
+-----------------
+
+.. kernel-doc:: fs/debugfs/inode.c
+   :export:
+
+.. kernel-doc:: fs/debugfs/file.c
+   :export:
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 61d2441b25d5..1131c34d77f6 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -1,389 +1,43 @@
-=====================
-Linux Filesystems API
-=====================
+===============================
+Filesystems in the Linux kernel
+===============================
 
-The Linux VFS
-=============
+This under-development manual will, some glorious day, provide
+comprehensive information on how the Linux virtual filesystem (VFS) layer
+works, along with the filesystems that sit below it.  For now, what we have
+can be found below.
 
-The Filesystem types
---------------------
-
-.. kernel-doc:: include/linux/fs.h
-   :internal:
-
-The Directory Cache
--------------------
-
-.. kernel-doc:: fs/dcache.c
-   :export:
-
-.. kernel-doc:: include/linux/dcache.h
-   :internal:
-
-Inode Handling
---------------
-
-.. kernel-doc:: fs/inode.c
-   :export:
-
-.. kernel-doc:: fs/bad_inode.c
-   :export:
-
-Registration and Superblocks
-----------------------------
-
-.. kernel-doc:: fs/super.c
-   :export:
-
-File Locks
-----------
-
-.. kernel-doc:: fs/locks.c
-   :export:
-
-.. kernel-doc:: fs/locks.c
-   :internal:
-
-Other Functions
----------------
-
-.. kernel-doc:: fs/mpage.c
-   :export:
-
-.. kernel-doc:: fs/namei.c
-   :export:
-
-.. kernel-doc:: fs/buffer.c
-   :export:
-
-.. kernel-doc:: block/bio.c
-   :export:
-
-.. kernel-doc:: fs/seq_file.c
-   :export:
-
-.. kernel-doc:: fs/filesystems.c
-   :export:
-
-.. kernel-doc:: fs/fs-writeback.c
-   :export:
-
-.. kernel-doc:: fs/block_dev.c
-   :export:
-
-.. kernel-doc:: fs/anon_inodes.c
-   :export:
-
-.. kernel-doc:: fs/attr.c
-   :export:
-
-.. kernel-doc:: fs/d_path.c
-   :export:
-
-.. kernel-doc:: fs/dax.c
-   :export:
-
-.. kernel-doc:: fs/direct-io.c
-   :export:
-
-.. kernel-doc:: fs/file_table.c
-   :export:
-
-.. kernel-doc:: fs/libfs.c
-   :export:
-
-.. kernel-doc:: fs/posix_acl.c
-   :export:
-
-.. kernel-doc:: fs/stat.c
-   :export:
-
-.. kernel-doc:: fs/sync.c
-   :export:
-
-.. kernel-doc:: fs/xattr.c
-   :export:
-
-The proc filesystem
-===================
-
-sysctl interface
-----------------
-
-.. kernel-doc:: kernel/sysctl.c
-   :export:
-
-proc filesystem interface
--------------------------
-
-.. kernel-doc:: fs/proc/base.c
-   :internal:
-
-Events based on file descriptors
-================================
-
-.. kernel-doc:: fs/eventfd.c
-   :export:
-
-The Filesystem for Exporting Kernel Objects
-===========================================
-
-.. kernel-doc:: fs/sysfs/file.c
-   :export:
-
-.. kernel-doc:: fs/sysfs/symlink.c
-   :export:
-
-The debugfs filesystem
+Core VFS documentation
 ======================
 
-debugfs interface
------------------
+See these manuals for documentation about the VFS layer itself and how its
+algorithms work.
 
-.. kernel-doc:: fs/debugfs/inode.c
-   :export:
+.. toctree::
+   :maxdepth: 2
 
-.. kernel-doc:: fs/debugfs/file.c
-   :export:
+   path-lookup.rst
+   api-summary
+   splice
 
-The Linux Journalling API
+Filesystem support layers
 =========================
 
-Overview
---------
-
-Details
-~~~~~~~
-
-The journalling layer is easy to use. You need to first of all create a
-journal_t data structure. There are two calls to do this dependent on
-how you decide to allocate the physical media on which the journal
-resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
-filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
-for journal stored on a raw device (in a continuous range of blocks). A
-journal_t is a typedef for a struct pointer, so when you are finally
-finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
-any used kernel memory.
-
-Once you have got your journal_t object you need to 'mount' or load the
-journal file. The journalling layer expects the space for the journal
-was already allocated and initialized properly by the userspace tools.
-When loading the journal you must call :c:func:`jbd2_journal_load` to process
-journal contents. If the client file system detects the journal contents
-does not need to be processed (or even need not have valid contents), it
-may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
-calling :c:func:`jbd2_journal_load`.
-
-Note that jbd2_journal_wipe(..,0) calls
-:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
-transactions in the journal and similarly :c:func:`jbd2_journal_load` will
-call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
-:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
-
-Now you can go ahead and start modifying the underlying filesystem.
-Almost.
-
-You still need to actually journal your filesystem changes, this is done
-by wrapping them into transactions. Additionally you also need to wrap
-the modification of each of the buffers with calls to the journal layer,
-so it knows what the modifications you are actually making are. To do
-this use :c:func:`jbd2_journal_start` which returns a transaction handle.
-
-:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
-which indicates the end of a transaction are nestable calls, so you can
-reenter a transaction if necessary, but remember you must call
-:c:func:`jbd2_journal_stop` the same number of times as
-:c:func:`jbd2_journal_start` before the transaction is completed (or more
-accurately leaves the update phase). Ext4/VFS makes use of this feature to
-simplify handling of inode dirtying, quota support, etc.
-
-Inside each transaction you need to wrap the modifications to the
-individual buffers (blocks). Before you start to modify a buffer you
-need to call :c:func:`jbd2_journal_get_create_access()` /
-:c:func:`jbd2_journal_get_write_access()` /
-:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
-journalling layer to copy the unmodified
-data if it needs to. After all the buffer may be part of a previously
-uncommitted transaction. At this point you are at last ready to modify a
-buffer, and once you are have done so you need to call
-:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
-buffer you now know is now longer required to be pushed back on the
-device you can call :c:func:`jbd2_journal_forget` in much the same way as you
-might have used :c:func:`bforget` in the past.
-
-A :c:func:`jbd2_journal_flush` may be called at any time to commit and
-checkpoint all your transactions.
-
-Then at umount time , in your :c:func:`put_super` you can then call
-:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
-
-Unfortunately there a couple of ways the journal layer can cause a
-deadlock. The first thing to note is that each task can only have a
-single outstanding transaction at any one time, remember nothing commits
-until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
-the transaction at the end of each file/inode/address etc. operation you
-perform, so that the journalling system isn't re-entered on another
-journal. Since transactions can't be nested/batched across differing
-journals, and another filesystem other than yours (say ext4) may be
-modified in a later syscall.
-
-The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
-if there isn't enough space in the journal for your transaction (based
-on the passed nblocks param) - when it blocks it merely(!) needs to wait
-for transactions to complete and be committed from other tasks, so
-essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
-deadlocks you must treat :c:func:`jbd2_journal_start` /
-:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
-your semaphore ordering rules to prevent
-deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
-behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
-easily as on :c:func:`jbd2_journal_start`.
-
-Try to reserve the right number of blocks the first time. ;-). This will
-be the maximum number of blocks you are going to touch in this
-transaction. I advise having a look at at least ext4_jbd.h to see the
-basis on which ext4 uses to make these decisions.
-
-Another wriggle to watch out for is your on-disk block allocation
-strategy. Why? Because, if you do a delete, you need to ensure you
-haven't reused any of the freed blocks until the transaction freeing
-these blocks commits. If you reused these blocks and crash happens,
-there is no way to restore the contents of the reallocated blocks at the
-end of the last fully committed transaction. One simple way of doing
-this is to mark blocks as free in internal in-memory block allocation
-structures only after the transaction freeing them commits. Ext4 uses
-journal commit callback for this purpose.
-
-With journal commit callbacks you can ask the journalling layer to call
-a callback function when the transaction is finally committed to disk,
-so that you can do some of your own management. You ask the journalling
-layer for calling the callback by simply setting
-``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
-
-JBD2 also provides a way to block all transaction updates via
-:c:func:`jbd2_journal_lock_updates()` /
-:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
-window with a clean and stable fs for a moment. E.g.
-
-::
-
-
-        jbd2_journal_lock_updates() //stop new stuff happening..
-        jbd2_journal_flush()        // checkpoint everything.
-        ..do stuff on stable fs
-        jbd2_journal_unlock_updates() // carry on with filesystem use.
-
-The opportunities for abuse and DOS attacks with this should be obvious,
-if you allow unprivileged userspace to trigger codepaths containing
-these calls.
-
-Summary
-~~~~~~~
-
-Using the journal is a matter of wrapping the different context changes,
-being each mount, each modification (transaction) and each changed
-buffer to tell the journalling layer about them.
-
-Data Types
-----------
-
-The journalling layer uses typedefs to 'hide' the concrete definitions
-of the structures used. As a client of the JBD2 layer you can just rely
-on the using the pointer as a magic cookie of some sort. Obviously the
-hiding is not enforced as this is 'C'.
-
-Structures
-~~~~~~~~~~
-
-.. kernel-doc:: include/linux/jbd2.h
-   :internal:
-
-Functions
----------
-
-The functions here are split into two groups those that affect a journal
-as a whole, and those which are used to manage transactions
-
-Journal Level
-~~~~~~~~~~~~~
-
-.. kernel-doc:: fs/jbd2/journal.c
-   :export:
-
-.. kernel-doc:: fs/jbd2/recovery.c
-   :internal:
-
-Transasction Level
-~~~~~~~~~~~~~~~~~~
-
-.. kernel-doc:: fs/jbd2/transaction.c
-
-See also
---------
-
-`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
-Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
-
-`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
-Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
-
-splice API
-==========
-
-splice is a method for moving blocks of data around inside the kernel,
-without continually transferring them between the kernel and user space.
-
-.. kernel-doc:: fs/splice.c
-
-pipes API
-=========
-
-Pipe interfaces are all for in-kernel (builtin image) use. They are not
-exported for use by modules.
-
-.. kernel-doc:: include/linux/pipe_fs_i.h
-   :internal:
-
-.. kernel-doc:: fs/pipe.c
-
-Encryption API
-==============
-
-A library which filesystems can hook into to support transparent
-encryption of files and directories.
+Documentation for the support code within the filesystem layer for use in
+filesystem implementations.
 
 .. toctree::
-    :maxdepth: 2
-
-    fscrypt
-
-Pathname lookup
-===============
-
-
-This write-up is based on three articles published at lwn.net:
+   :maxdepth: 2
 
-- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
-- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
-- <https://lwn.net/Articles/650786/> A walk among the symlinks
+   journalling
+   fscrypt
 
-Written by Neil Brown with help from Al Viro and Jon Corbet.
-It has subsequently been updated to reflect changes in the kernel
-including:
+Filesystem-specific documentation
+=================================
 
-- per-directory parallel name lookup.
+Documentation for individual filesystem types can be found here.
 
 .. toctree::
    :maxdepth: 2
 
-   path-lookup.rst
-
-binderfs
-========
-
-.. toctree::
-
    binderfs.rst
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
new file mode 100644
index 000000000000..58ce6b395206
--- /dev/null
+++ b/Documentation/filesystems/journalling.rst
@@ -0,0 +1,184 @@
+The Linux Journalling API
+=========================
+
+Overview
+--------
+
+Details
+~~~~~~~
+
+The journalling layer is easy to use. You need to first of all create a
+journal_t data structure. There are two calls to do this dependent on
+how you decide to allocate the physical media on which the journal
+resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
+filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
+for journal stored on a raw device (in a continuous range of blocks). A
+journal_t is a typedef for a struct pointer, so when you are finally
+finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
+any used kernel memory.
+
+Once you have got your journal_t object you need to 'mount' or load the
+journal file. The journalling layer expects the space for the journal
+was already allocated and initialized properly by the userspace tools.
+When loading the journal you must call :c:func:`jbd2_journal_load` to process
+journal contents. If the client file system detects the journal contents
+does not need to be processed (or even need not have valid contents), it
+may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
+calling :c:func:`jbd2_journal_load`.
+
+Note that jbd2_journal_wipe(..,0) calls
+:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
+transactions in the journal and similarly :c:func:`jbd2_journal_load` will
+call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
+:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
+
+Now you can go ahead and start modifying the underlying filesystem.
+Almost.
+
+You still need to actually journal your filesystem changes, this is done
+by wrapping them into transactions. Additionally you also need to wrap
+the modification of each of the buffers with calls to the journal layer,
+so it knows what the modifications you are actually making are. To do
+this use :c:func:`jbd2_journal_start` which returns a transaction handle.
+
+:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
+which indicates the end of a transaction are nestable calls, so you can
+reenter a transaction if necessary, but remember you must call
+:c:func:`jbd2_journal_stop` the same number of times as
+:c:func:`jbd2_journal_start` before the transaction is completed (or more
+accurately leaves the update phase). Ext4/VFS makes use of this feature to
+simplify handling of inode dirtying, quota support, etc.
+
+Inside each transaction you need to wrap the modifications to the
+individual buffers (blocks). Before you start to modify a buffer you
+need to call :c:func:`jbd2_journal_get_create_access()` /
+:c:func:`jbd2_journal_get_write_access()` /
+:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
+journalling layer to copy the unmodified
+data if it needs to. After all the buffer may be part of a previously
+uncommitted transaction. At this point you are at last ready to modify a
+buffer, and once you are have done so you need to call
+:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
+buffer you now know is now longer required to be pushed back on the
+device you can call :c:func:`jbd2_journal_forget` in much the same way as you
+might have used :c:func:`bforget` in the past.
+
+A :c:func:`jbd2_journal_flush` may be called at any time to commit and
+checkpoint all your transactions.
+
+Then at umount time , in your :c:func:`put_super` you can then call
+:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
+
+Unfortunately there a couple of ways the journal layer can cause a
+deadlock. The first thing to note is that each task can only have a
+single outstanding transaction at any one time, remember nothing commits
+until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
+the transaction at the end of each file/inode/address etc. operation you
+perform, so that the journalling system isn't re-entered on another
+journal. Since transactions can't be nested/batched across differing
+journals, and another filesystem other than yours (say ext4) may be
+modified in a later syscall.
+
+The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
+if there isn't enough space in the journal for your transaction (based
+on the passed nblocks param) - when it blocks it merely(!) needs to wait
+for transactions to complete and be committed from other tasks, so
+essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
+deadlocks you must treat :c:func:`jbd2_journal_start` /
+:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
+your semaphore ordering rules to prevent
+deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
+behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
+easily as on :c:func:`jbd2_journal_start`.
+
+Try to reserve the right number of blocks the first time. ;-). This will
+be the maximum number of blocks you are going to touch in this
+transaction. I advise having a look at at least ext4_jbd.h to see the
+basis on which ext4 uses to make these decisions.
+
+Another wriggle to watch out for is your on-disk block allocation
+strategy. Why? Because, if you do a delete, you need to ensure you
+haven't reused any of the freed blocks until the transaction freeing
+these blocks commits. If you reused these blocks and crash happens,
+there is no way to restore the contents of the reallocated blocks at the
+end of the last fully committed transaction. One simple way of doing
+this is to mark blocks as free in internal in-memory block allocation
+structures only after the transaction freeing them commits. Ext4 uses
+journal commit callback for this purpose.
+
+With journal commit callbacks you can ask the journalling layer to call
+a callback function when the transaction is finally committed to disk,
+so that you can do some of your own management. You ask the journalling
+layer for calling the callback by simply setting
+``journal->j_commit_callback`` function pointer and that function is
+called after each transaction commit. You can also use
+``transaction->t_private_list`` for attaching entries to a transaction
+that need processing when the transaction commits.
+
+JBD2 also provides a way to block all transaction updates via
+:c:func:`jbd2_journal_lock_updates()` /
+:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
+window with a clean and stable fs for a moment. E.g.
+
+::
+
+
+        jbd2_journal_lock_updates() //stop new stuff happening..
+        jbd2_journal_flush()        // checkpoint everything.
+        ..do stuff on stable fs
+        jbd2_journal_unlock_updates() // carry on with filesystem use.
+
+The opportunities for abuse and DOS attacks with this should be obvious,
+if you allow unprivileged userspace to trigger codepaths containing
+these calls.
+
+Summary
+~~~~~~~
+
+Using the journal is a matter of wrapping the different context changes,
+being each mount, each modification (transaction) and each changed
+buffer to tell the journalling layer about them.
+
+Data Types
+----------
+
+The journalling layer uses typedefs to 'hide' the concrete definitions
+of the structures used. As a client of the JBD2 layer you can just rely
+on the using the pointer as a magic cookie of some sort. Obviously the
+hiding is not enforced as this is 'C'.
+
+Structures
+~~~~~~~~~~
+
+.. kernel-doc:: include/linux/jbd2.h
+   :internal:
+
+Functions
+---------
+
+The functions here are split into two groups those that affect a journal
+as a whole, and those which are used to manage transactions
+
+Journal Level
+~~~~~~~~~~~~~
+
+.. kernel-doc:: fs/jbd2/journal.c
+   :export:
+
+.. kernel-doc:: fs/jbd2/recovery.c
+   :internal:
+
+Transasction Level
+~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: fs/jbd2/transaction.c
+
+See also
+--------
+
+`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
+Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
+
+`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
+Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
+
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 80e22eda4132..434a07b0002b 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1,3 +1,18 @@
+===============
+Pathname lookup
+===============
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+It has subsequently been updated to reflect changes in the kernel
+including:
+
+- per-directory parallel name lookup.
 
 Introduction to pathname lookup
 ===============================
diff --git a/Documentation/filesystems/splice.rst b/Documentation/filesystems/splice.rst
new file mode 100644
index 000000000000..edd874808472
--- /dev/null
+++ b/Documentation/filesystems/splice.rst
@@ -0,0 +1,22 @@
+================
+splice and pipes
+================
+
+splice API
+==========
+
+splice is a method for moving blocks of data around inside the kernel,
+without continually transferring them between the kernel and user space.
+
+.. kernel-doc:: fs/splice.c
+
+pipes API
+=========
+
+Pipe interfaces are all for in-kernel (builtin image) use. They are not
+exported for use by modules.
+
+.. kernel-doc:: include/linux/pipe_fs_i.h
+   :internal:
+
+.. kernel-doc:: fs/pipe.c
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info
  2019-02-21 19:02 [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
  2019-02-21 19:02 ` [PATCH 1/2] docs: Bring some order to filesystem documentation Jonathan Corbet
@ 2019-02-21 19:02 ` Jonathan Corbet
  2019-02-21 19:06 ` [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
  2 siblings, 0 replies; 5+ messages in thread
From: Jonathan Corbet @ 2019-02-21 19:02 UTC (permalink / raw)
  To: linux-doc; +Cc: linux-kernel, Al Viro, axboe, Jonathan Corbet

Work up some text posted by Al and add it to the filesystem manual.

Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/filesystems/index.rst      |   1 +
 Documentation/filesystems/lifecycles.rst | 357 +++++++++++++++++++++++
 2 files changed, 358 insertions(+)
 create mode 100644 Documentation/filesystems/lifecycles.rst

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..44ff355e0be6 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -16,6 +16,7 @@ algorithms work.
 .. toctree::
    :maxdepth: 2
 
+   lifecycles
    path-lookup.rst
    api-summary
    splice
diff --git a/Documentation/filesystems/lifecycles.rst b/Documentation/filesystems/lifecycles.rst
new file mode 100644
index 000000000000..b30f566cfe0d
--- /dev/null
+++ b/Documentation/filesystems/lifecycles.rst
@@ -0,0 +1,357 @@
+======================
+Lifecycles and locking
+======================
+
+This manual aspires to cover the lifecycles of VFS objects and the locking
+that protects them.
+
+Reference counting for file structures
+======================================
+
+(The following text derives from `this email from Al Viro
+<https://lwn.net/ml/linux-fsdevel/20190207040058.GW2217@ZenIV.linux.org.uk/>`_).
+
+The :c:type:`struct file` type represents an open file in the kernel.  Its
+lifetime is controlled by a simple reference count (f_count) in that
+structure.  References are obtained with functions like fget(), fdget(),
+and fget_raw(); they are returned with fput().
+
+.. FIXME we should have kerneldoc comments for those functions
+
+The struct file destructor (__fput() and the filesystem-specific
+->release() function called from it) is called once the counter hits zero.
+Each file descriptor counts as a reference.  Thus, dup() will increment
+the refcount by 1, close() will decrement it, fork() will increment it
+by the number of descriptors in your descriptor table refering to this
+struct file, destruction of the descriptor table on exit() will decrement
+by the same amount, etc.
+
+Syscalls like read() and friends turn descriptors into struct file
+references.  If the descriptor table is shared, that counts as a new
+reference that must be dropped in the end of the syscall; otherwise we are
+guaranteed that the reference in the descriptor table will stay around
+until the end of the syscall, so we may use it without bumping the file
+refcount.  That's the difference between fget() and fdget() - the former
+will bump the refcount, while the latter will try to avoid that.  Of
+course, if we do not intend to drop the reference we'd acquired by the end
+of the syscall, we want fget(); fdget() is for transient references only.
+
+Descriptor tables
+-----------------
+
+Descriptor tables (:c:type:`struct files_struct`) *can* be shared; several
+processes (usually threads that share address spaces as well, but that's
+not necessary) may be working with the same set of struct files so, for
+example, an open() call in one of them is seen by the others.  The same
+goes for close(), dup(), dup2(), etc.
+
+That makes for an interesting corner case: what if two threads happen to
+share a descriptor table, and one of them closes a file descriptor while
+another is in the middle of a read() call on that same descriptor?  That's
+one area where Unices differ; one variant is to abort the read() call,
+another would have close() wait for the read() call to finish, etc.  What
+we do is:
+
+  * close() succeeds immediately; the reference is removed from
+    the descriptor table and dropped.
+
+  * If the close() call happens before read(fd, ...) has converted the file
+    descriptor to a struct file reference, read() will fail with -EBADF.
+
+  * Otherwise, read() proceeds unmolested.  The reference it has acquired
+    is dropped at the end of the syscall.  If that's the last reference to
+    the file, the file structure will get shut down at that point.
+
+A call to clone() will result in the child sharing the parent's descriptor
+table if CLONE_FILES is in the flags.  Note that, in this case, struct file
+refcounts are not modified at all, since no new references to files are
+created.  Without CLONE_FILES, it's the same as fork(): an independent copy
+of the descriptor table is created and populated by copies of references to
+files, each bumping file's refcount.
+
+Calling unshare() with CLONE_FILES in the flags will create a copy of the
+descriptor table (same as done on fork(), etc.) and switch to using it; the
+old reference will be dropped (note: it'll only bother with that if
+descriptor table used to be shared in the first place; if we hold the only
+reference to descriptor table, we'll just keep using it).
+
+execve() does almost the same thing: if the pre-exec descriptor table is
+shared, it will switch to a new copy first.  In case of success the
+reference to the original table is dropped, in case of failure we revert to
+the original and drop the copy.  Note that handling of close-on-exec is
+done in the *copy*; the original is unaffected, so failing in execve() does
+not disrupt the descriptor table.
+
+exit() will drop the reference to the descriptor table.  When the last
+reference is dropped, all file references are removed from it (and dropped).
+
+The thread's pointer to its descriptor table (current->files) is never
+modified by other threads; something like::
+
+  ls /proc/<pid>/fd 
+
+will fetch it, so stores need to be protected (by task_lock(current)), but
+the only the thread itself can do them.
+
+Note that, while extra references to the descriptor table can appear at any
+time (/proc/<pid>/fd accesses, for example), such references may not be
+used for modifications.  In particular, you can't switch to another
+thread's descriptor table, unless it had been yours at some earlier point
+*and* you've kept a reference to it.
+
+That's about it for descriptor tables; that, by far, is the main source of
+persistently held struct file references.  Transient references are grabbed
+by syscalls when they resolve a descriptor to a struct file pointer, which
+ought to be done once per syscall *and* reasonably early in it.
+Unfortunately, that's not all; there are other persistent struct file
+references.
+
+Other persistent references
+---------------------------
+
+A key point so far is that references to file structures are not held
+(directly or indirectly) in other file structures.  If that were
+universally true, life would be simpler, since we would never have to worry
+about reference-count loops.  Unfortunately, there are some more
+complicated cases that the kernel has to worry about.
+
+Some things, such as the case of a LOOP_SET_FD ioctl() call grabbing a
+reference to a file structure and stashing it in the lo_backing_file field
+of a loop_device structure, are reasonably simple.  The struct file
+reference will be dropped later, either directly by a LOOP_CLR_FD operation
+(if nothing else holds the thing open at the time) or later in
+lo_release().
+
+Note that, in the latter case, things can get a bit more complicated.  A
+process closing /dev/loop might drop the last reference to it, triggering a
+call to bdput() that releases the last reference holding a block device
+open.  That will trigger a call to lo_release(), which will drop the
+reference on the underlying file structure, which is almost certainly the
+last one at that point.  This case is still not a problem; while we do have
+the underlying struct file pinned by something held by another struct file,
+the dependency graph is acyclic, so the plain refcounts we are using work
+fine.
+
+The same goes for the things like e.g. ecryptfs opening an underlying
+(encrypted) file on open() and dropping it when the last reference to
+ecryptfs file is dropped; the only difference here is that the underlying
+struct file never appears in anyone's descriptor tables.
+
+However, in a couple of cases we do have something trickier.
+
+File references and SCM_RIGHTS
+------------------------------
+
+The SCM_RIGHTS datagram option with Unix-domain sockets can be used to
+transfer a file descriptor, and its associated struct file reference, to
+the receiving process.  That brings about a couple of situations where
+things can go wrong.
+
+Case 1: an SCM_RIGHTS datagram can be sent to an AF_UNIX socket.  That
+converts the caller-supplied array of descriptors into an array of struct
+file references, which gets attached to the packet we queue.  When the
+datagram is received, the struct file references are moved into the
+descriptor table of the recepient or, in case of error, dropped.  Note that
+sending some descriptors in an SCM_RIGHTS datagram and closing them
+immediately is perfectly legitimate: as soon as sendmsg() returns you can
+go ahead and close the descriptors you've sent.  The references for the
+recipient are already acquired, so you don't need to wait for the packet to
+be received.
+
+That would still be simple, if not for the fact that there's nothing to
+stop you from passing AF_UNIX sockets themselves around in the same way.
+In fact, that has legitimate uses and, most of the time, doesn't cause any
+complications at all.  However, it is possible to get the situation when
+the following happens:
+
+  * struct file instances A and B are both AF_UNIX sockets.
+  * The only reference to A is in the SCM_RIGHTS packet that sits in the
+    receiving queue of B.
+  * The only reference to B is in the SCM_RIGHTS packet that sits in the
+    receiving queue of A.
+
+That, of course, is where pure refcounting of any kind will break.
+
+The SCM_RIGHTS datagram that contains the sole reference to A can't be
+received without the recepient getting hold of a reference to B.  That
+cannot happen until somebody manages to receive the SCM_RIGHTS datagram
+containing the sole reference to B.  But that cannot happen until that
+somebody manages to get hold of a reference to A, which cannot happen until
+the first SCM_RIGHTS datagram is received.
+
+Dropping the last reference to A would have discarded everything in its
+receiving queue, including the SCM_RIGHTS datagram that contains the
+reference to B; however, that can't happen either; the other SCM_RIGHTS
+datagram would have to be either received or discarded first, etc.
+
+Case 2: similar, with a bit of a twist.  An AF_UNIX socket used for
+descriptor passing is normally set up by socket(), followed by connect().
+As soon as connect() returns, one can start sending.  Note that connect()
+does *NOT* wait for the recepient to call accept(); it creates the object
+that will serve as the low-level part of the other end of connection
+(complete with received packet queue) and stashes that object into the
+queue of the *listener's* socket.  A subsequent accept() call fetches it
+from there and attaches it to a new socket, completing the setup; in the
+meanwhile, sending packets works fine.  Once accept() is done, it'll see
+the stuff you'd sent already in the queue of the new socket and everything
+works fine.
+
+If the listening socket gets closed without accept() having been called,
+its queue is flushed, discarding all pending connection attempts, complete
+with *their* queues.  Which is the same effect as accept() + close(), so
+again, normally everything just works.  However, consider the case when we
+have:
+
+  * struct file instances A and B being AF_UNIX sockets.
+  * A is a listener
+  * B is an established connection, with the other end yet to be accepted
+    on A 
+  * The only references to A and B are in an SCM_RIGHTS datagram sent over
+    to A by B.
+
+That SCM_RIGHTS datagram could have been received if somebody had managed
+to call accept() on A and recvmsg() on the socket created by that accept()
+call.  But that can't happen without that somebody getting hold of a
+reference to A in the first place, which can't happen without having
+received that SCM_RIGHTS datagram.  It can't be discarded either, since
+that can't happen without dropping the last reference to A, which sits
+right in it.
+
+The difference from the previous case is that there we had:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds SCM_RIGHTS with reference to B
+  * B holds unix_sock of B
+  * unix_sock of B holds SCM_RIGHTS with reference to A
+
+and here we have:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds the packet with reference to embryonic unix_sock
+    created by connect() 
+  * that embryionic unix_sock holds SCM_RIGHTS with references to A and B.
+
+The dependency graph is different, but the problem is the same; there are
+unreachable loops in it.  Note that neither class of situations
+would occur normally; in the best case it's "somebody had been
+doing rather convoluted descriptor passing, but everyone involved
+got hit with kill -9 at the wrong time; please, make sure nothing
+leaks".  That can happen, but a userland race (e.g. botched protocol
+handling of some sort) or a deliberate abuse are much more likely.
+
+Catching the loop creation is hard and paying for that every time we do
+descriptor-passing would be a bad idea.  Besides, the loop per se is not
+fatal; if, for example, in the second case the descriptor for A had been
+kept around, close(accept()) would've cleaned everything up.  Which means
+that we need a garbage collector to deal with the (rare) leaks.
+
+Note that, in both cases, the leaks are caused by loops passing through
+some SCM_RIGHTS datagrams that can never be received.  So locating those,
+removing them from the queues they sit in and then discarding the suckers,
+is enough to resolve the situation. Furthermore, in both cases the loop
+passes through the unix_sock of something that got sent over in an
+SCM_RIGHTS datagram.  So we can do the following:
+
+  1) Keep the count of references to file structures of AF_UNIX sockets
+     held by SCM_RIGHTS; this value is kept in unix_sock->inflight.  Any
+     struct unix_sock instance without such references is not a part of
+     unreachable loop.  Maintain the set of unix_sock that are not excluded
+     by that (i.e. the ones that have some of references from SCM_RIGHTS
+     instances).  Note that we don't need to maintain those counts in
+     struct file; we care only about unix_sock here.
+
+  2) Any struct file of an AF_UNIX socket with some references *NOT* from
+     SCM_RIGHTS datagrams is also not a part of unreachable loop.
+
+  3) For each unix_sock, consider the following set of SCM_RIGHTS
+     datagrams: everything in the queue of that unix_sock if it's a
+     non-listener, and everything in queues of *all* embryonic unix_sock
+     structs in the queue of a listener.  Let's call those the SCM_RIGHTS
+     associated with our unix_sock.
+
+  4) All SCM_RIGHTS associated with a reachable unix_sock are themselves
+     reachable.
+
+  5) if some references to the struct file of a unix_sock are in reachable
+     SCM_RIGHTS, that struct file is reachable.
+
+The garbage collector starts with calculating the set of potentially
+unreachable unix_socks:  the ones not excluded by (1, 2).  No unix_sock
+instances outside of that set need to be considered.
+
+If some unix_sock in that set has a counter that is *not* entirely covered
+by SCM_RIGHTS associated with the elements of the set, we can conclude that
+there are references to it in SCM_RIGHTS associated with something outside
+of our set and therefore it is reachable and can be removed from the set.
+
+If that process converges to a non-empty set, we know that everything left
+in that set is unreachable - all references to their struct file come from
+some SCM_RIGHTS datagrams, and all those SCM_RIGHTS datagrams are among
+those that can't be received or discarded without getting hold of a
+reference to struct file of something in our set.
+
+Everything outside of that set is reachable, so taking the SCM_RIGHTS with
+references to stuff in our set (all of them to be found among those
+associated with elements of our set) out of the queues they are in will
+break all unreachable loops.  Discarding the collected datagrams will do
+the rest - the file references in those will be dropped, etc.
+
+One thing to keep in mind here is the locking.  What the garbage
+collector relies upon is:
+
+  * Changes to ->inflight are serialized with respect to it (on
+    unix_gc_lock; increments are done by unix_inflight(), decrements by
+    unix_notinflight()).
+
+  * Any references extracted from SCM_RIGHTS during the garbage collector
+    run will not be actually used until the end of garbage collection.  For
+    a normal recvmsg() call, this behavior is guaranteed by having
+    unix_notinflight() called between the extraction of scm_fp_list from
+    the packet and doing anything else with the references extracted.  For
+    a MSG_PEEK recvmsg() call, it's actually broken and lacks
+    synchronization; Miklos has proposed to grab and release unix_gc_lock
+    in those, between scm_fp_dup() and doing anything else with the
+    references copied.
+
+.. FIXME: The above should be updates when the fix happens.
+
+  * adding SCM_RIGHTS in the middle of garbage collection is possible, but
+    in that case it will contain no references to anything in the initial
+    candidate set.
+
+The last one is delicate.  SCM_RIGHTS creation has unix_inflight() called
+for each reference we put there, so it's serialized with respect to
+unix_gc(); however, insertion into the queue is *NOT* covered by that.
+Queue rescans are covered, but each queue has a lock of its own and they
+are definitely not going to be held throughout the whole thing.
+
+So in theory it would be possible to have:
+
+  * thread A: sendmsg() has SCM_RIGHTS created and populated, complete with
+    file refcount and ->inflight increments implied, at which point it gets
+    preempted and loses the timeslice.
+
+  * thread B: gets to run and removes all references from descriptor table
+    it shares with thread A.
+
+  * on another CPU we have the garbage collector triggered; it determines
+    the set of potentially unreachable unix_sock and everything in our
+    SCM_RIGHTS _is_ in that set, now that no other references remain.
+
+  * on the first CPU, thread A regains the timeslice and inserts its
+    SCM_RIGHTS into queue.  And it does contain references to sockets from
+    the candidate set of running garbage collector, confusing the hell out
+    of it.
+
+That is avoided by a convoluted dance around the SCM_RIGHTS creation
+and insertion - we use fget() to obtain struct file references,
+then _duplicate_ them in SCM_RIGHTS (bumping a refcount for each, so
+we are holding *two* references), do unix_inflight() on them, then
+queue the damn thing, then drop each reference we got from fget().
+
+That way everything referred to in that SCM_RIGHTS is going to have
+extra struct file references (and thus be excluded from the initial
+candidate set) until after it gets inserted into queue.  In other
+words, if it does appear in a queue between two passes, it's
+guaranteed to contain no references to anything in the initial
+canidate set.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] docs: Add some struct file refcounting information
  2019-02-21 19:02 [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
  2019-02-21 19:02 ` [PATCH 1/2] docs: Bring some order to filesystem documentation Jonathan Corbet
  2019-02-21 19:02 ` [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info Jonathan Corbet
@ 2019-02-21 19:06 ` Jonathan Corbet
  2 siblings, 0 replies; 5+ messages in thread
From: Jonathan Corbet @ 2019-02-21 19:06 UTC (permalink / raw)
  To: linux-doc; +Cc: linux-kernel, Al Viro, axboe

On Thu, 21 Feb 2019 12:02:28 -0700
Jonathan Corbet <corbet@lwn.net> wrote:

> Al Viro's recent "braindump" on how struct file refcounting works is too
> good to lose, so let's capture it in Documentation/.  There are two changes
> here:
> 
>  - Clean up the filesystem book, move stuff around, and generally try to
>    create some order here.  No text changes other than some connecting
>    text.
> 
>  - Bring in Al's text, with a fair amount of language tweaking and later
>    corrections applied.
> 
> Hopefully this is just the beginning.

I meant to add that, should anybody be curious to see the formatted
version of this work, it's at:

  https://static.lwn.net/kerneldoc/filesystems/index.html

jon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info
  2019-03-04 20:01 [RESEND PATCH " Jonathan Corbet
@ 2019-03-04 20:01 ` Jonathan Corbet
  0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Corbet @ 2019-03-04 20:01 UTC (permalink / raw)
  To: linux-doc; +Cc: linux-kernel, linux-fsdevel, Al Viro, axboe, Jonathan Corbet

Work up some text posted by Al and add it to the filesystem manual.

Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/filesystems/index.rst      |   1 +
 Documentation/filesystems/lifecycles.rst | 357 +++++++++++++++++++++++
 2 files changed, 358 insertions(+)
 create mode 100644 Documentation/filesystems/lifecycles.rst

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..44ff355e0be6 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -16,6 +16,7 @@ algorithms work.
 .. toctree::
    :maxdepth: 2
 
+   lifecycles
    path-lookup.rst
    api-summary
    splice
diff --git a/Documentation/filesystems/lifecycles.rst b/Documentation/filesystems/lifecycles.rst
new file mode 100644
index 000000000000..b30f566cfe0d
--- /dev/null
+++ b/Documentation/filesystems/lifecycles.rst
@@ -0,0 +1,357 @@
+======================
+Lifecycles and locking
+======================
+
+This manual aspires to cover the lifecycles of VFS objects and the locking
+that protects them.
+
+Reference counting for file structures
+======================================
+
+(The following text derives from `this email from Al Viro
+<https://lwn.net/ml/linux-fsdevel/20190207040058.GW2217@ZenIV.linux.org.uk/>`_).
+
+The :c:type:`struct file` type represents an open file in the kernel.  Its
+lifetime is controlled by a simple reference count (f_count) in that
+structure.  References are obtained with functions like fget(), fdget(),
+and fget_raw(); they are returned with fput().
+
+.. FIXME we should have kerneldoc comments for those functions
+
+The struct file destructor (__fput() and the filesystem-specific
+->release() function called from it) is called once the counter hits zero.
+Each file descriptor counts as a reference.  Thus, dup() will increment
+the refcount by 1, close() will decrement it, fork() will increment it
+by the number of descriptors in your descriptor table refering to this
+struct file, destruction of the descriptor table on exit() will decrement
+by the same amount, etc.
+
+Syscalls like read() and friends turn descriptors into struct file
+references.  If the descriptor table is shared, that counts as a new
+reference that must be dropped in the end of the syscall; otherwise we are
+guaranteed that the reference in the descriptor table will stay around
+until the end of the syscall, so we may use it without bumping the file
+refcount.  That's the difference between fget() and fdget() - the former
+will bump the refcount, while the latter will try to avoid that.  Of
+course, if we do not intend to drop the reference we'd acquired by the end
+of the syscall, we want fget(); fdget() is for transient references only.
+
+Descriptor tables
+-----------------
+
+Descriptor tables (:c:type:`struct files_struct`) *can* be shared; several
+processes (usually threads that share address spaces as well, but that's
+not necessary) may be working with the same set of struct files so, for
+example, an open() call in one of them is seen by the others.  The same
+goes for close(), dup(), dup2(), etc.
+
+That makes for an interesting corner case: what if two threads happen to
+share a descriptor table, and one of them closes a file descriptor while
+another is in the middle of a read() call on that same descriptor?  That's
+one area where Unices differ; one variant is to abort the read() call,
+another would have close() wait for the read() call to finish, etc.  What
+we do is:
+
+  * close() succeeds immediately; the reference is removed from
+    the descriptor table and dropped.
+
+  * If the close() call happens before read(fd, ...) has converted the file
+    descriptor to a struct file reference, read() will fail with -EBADF.
+
+  * Otherwise, read() proceeds unmolested.  The reference it has acquired
+    is dropped at the end of the syscall.  If that's the last reference to
+    the file, the file structure will get shut down at that point.
+
+A call to clone() will result in the child sharing the parent's descriptor
+table if CLONE_FILES is in the flags.  Note that, in this case, struct file
+refcounts are not modified at all, since no new references to files are
+created.  Without CLONE_FILES, it's the same as fork(): an independent copy
+of the descriptor table is created and populated by copies of references to
+files, each bumping file's refcount.
+
+Calling unshare() with CLONE_FILES in the flags will create a copy of the
+descriptor table (same as done on fork(), etc.) and switch to using it; the
+old reference will be dropped (note: it'll only bother with that if
+descriptor table used to be shared in the first place; if we hold the only
+reference to descriptor table, we'll just keep using it).
+
+execve() does almost the same thing: if the pre-exec descriptor table is
+shared, it will switch to a new copy first.  In case of success the
+reference to the original table is dropped, in case of failure we revert to
+the original and drop the copy.  Note that handling of close-on-exec is
+done in the *copy*; the original is unaffected, so failing in execve() does
+not disrupt the descriptor table.
+
+exit() will drop the reference to the descriptor table.  When the last
+reference is dropped, all file references are removed from it (and dropped).
+
+The thread's pointer to its descriptor table (current->files) is never
+modified by other threads; something like::
+
+  ls /proc/<pid>/fd 
+
+will fetch it, so stores need to be protected (by task_lock(current)), but
+the only the thread itself can do them.
+
+Note that, while extra references to the descriptor table can appear at any
+time (/proc/<pid>/fd accesses, for example), such references may not be
+used for modifications.  In particular, you can't switch to another
+thread's descriptor table, unless it had been yours at some earlier point
+*and* you've kept a reference to it.
+
+That's about it for descriptor tables; that, by far, is the main source of
+persistently held struct file references.  Transient references are grabbed
+by syscalls when they resolve a descriptor to a struct file pointer, which
+ought to be done once per syscall *and* reasonably early in it.
+Unfortunately, that's not all; there are other persistent struct file
+references.
+
+Other persistent references
+---------------------------
+
+A key point so far is that references to file structures are not held
+(directly or indirectly) in other file structures.  If that were
+universally true, life would be simpler, since we would never have to worry
+about reference-count loops.  Unfortunately, there are some more
+complicated cases that the kernel has to worry about.
+
+Some things, such as the case of a LOOP_SET_FD ioctl() call grabbing a
+reference to a file structure and stashing it in the lo_backing_file field
+of a loop_device structure, are reasonably simple.  The struct file
+reference will be dropped later, either directly by a LOOP_CLR_FD operation
+(if nothing else holds the thing open at the time) or later in
+lo_release().
+
+Note that, in the latter case, things can get a bit more complicated.  A
+process closing /dev/loop might drop the last reference to it, triggering a
+call to bdput() that releases the last reference holding a block device
+open.  That will trigger a call to lo_release(), which will drop the
+reference on the underlying file structure, which is almost certainly the
+last one at that point.  This case is still not a problem; while we do have
+the underlying struct file pinned by something held by another struct file,
+the dependency graph is acyclic, so the plain refcounts we are using work
+fine.
+
+The same goes for the things like e.g. ecryptfs opening an underlying
+(encrypted) file on open() and dropping it when the last reference to
+ecryptfs file is dropped; the only difference here is that the underlying
+struct file never appears in anyone's descriptor tables.
+
+However, in a couple of cases we do have something trickier.
+
+File references and SCM_RIGHTS
+------------------------------
+
+The SCM_RIGHTS datagram option with Unix-domain sockets can be used to
+transfer a file descriptor, and its associated struct file reference, to
+the receiving process.  That brings about a couple of situations where
+things can go wrong.
+
+Case 1: an SCM_RIGHTS datagram can be sent to an AF_UNIX socket.  That
+converts the caller-supplied array of descriptors into an array of struct
+file references, which gets attached to the packet we queue.  When the
+datagram is received, the struct file references are moved into the
+descriptor table of the recepient or, in case of error, dropped.  Note that
+sending some descriptors in an SCM_RIGHTS datagram and closing them
+immediately is perfectly legitimate: as soon as sendmsg() returns you can
+go ahead and close the descriptors you've sent.  The references for the
+recipient are already acquired, so you don't need to wait for the packet to
+be received.
+
+That would still be simple, if not for the fact that there's nothing to
+stop you from passing AF_UNIX sockets themselves around in the same way.
+In fact, that has legitimate uses and, most of the time, doesn't cause any
+complications at all.  However, it is possible to get the situation when
+the following happens:
+
+  * struct file instances A and B are both AF_UNIX sockets.
+  * The only reference to A is in the SCM_RIGHTS packet that sits in the
+    receiving queue of B.
+  * The only reference to B is in the SCM_RIGHTS packet that sits in the
+    receiving queue of A.
+
+That, of course, is where pure refcounting of any kind will break.
+
+The SCM_RIGHTS datagram that contains the sole reference to A can't be
+received without the recepient getting hold of a reference to B.  That
+cannot happen until somebody manages to receive the SCM_RIGHTS datagram
+containing the sole reference to B.  But that cannot happen until that
+somebody manages to get hold of a reference to A, which cannot happen until
+the first SCM_RIGHTS datagram is received.
+
+Dropping the last reference to A would have discarded everything in its
+receiving queue, including the SCM_RIGHTS datagram that contains the
+reference to B; however, that can't happen either; the other SCM_RIGHTS
+datagram would have to be either received or discarded first, etc.
+
+Case 2: similar, with a bit of a twist.  An AF_UNIX socket used for
+descriptor passing is normally set up by socket(), followed by connect().
+As soon as connect() returns, one can start sending.  Note that connect()
+does *NOT* wait for the recepient to call accept(); it creates the object
+that will serve as the low-level part of the other end of connection
+(complete with received packet queue) and stashes that object into the
+queue of the *listener's* socket.  A subsequent accept() call fetches it
+from there and attaches it to a new socket, completing the setup; in the
+meanwhile, sending packets works fine.  Once accept() is done, it'll see
+the stuff you'd sent already in the queue of the new socket and everything
+works fine.
+
+If the listening socket gets closed without accept() having been called,
+its queue is flushed, discarding all pending connection attempts, complete
+with *their* queues.  Which is the same effect as accept() + close(), so
+again, normally everything just works.  However, consider the case when we
+have:
+
+  * struct file instances A and B being AF_UNIX sockets.
+  * A is a listener
+  * B is an established connection, with the other end yet to be accepted
+    on A 
+  * The only references to A and B are in an SCM_RIGHTS datagram sent over
+    to A by B.
+
+That SCM_RIGHTS datagram could have been received if somebody had managed
+to call accept() on A and recvmsg() on the socket created by that accept()
+call.  But that can't happen without that somebody getting hold of a
+reference to A in the first place, which can't happen without having
+received that SCM_RIGHTS datagram.  It can't be discarded either, since
+that can't happen without dropping the last reference to A, which sits
+right in it.
+
+The difference from the previous case is that there we had:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds SCM_RIGHTS with reference to B
+  * B holds unix_sock of B
+  * unix_sock of B holds SCM_RIGHTS with reference to A
+
+and here we have:
+
+  * A holds unix_sock of A
+  * unix_sock of A holds the packet with reference to embryonic unix_sock
+    created by connect() 
+  * that embryionic unix_sock holds SCM_RIGHTS with references to A and B.
+
+The dependency graph is different, but the problem is the same; there are
+unreachable loops in it.  Note that neither class of situations
+would occur normally; in the best case it's "somebody had been
+doing rather convoluted descriptor passing, but everyone involved
+got hit with kill -9 at the wrong time; please, make sure nothing
+leaks".  That can happen, but a userland race (e.g. botched protocol
+handling of some sort) or a deliberate abuse are much more likely.
+
+Catching the loop creation is hard and paying for that every time we do
+descriptor-passing would be a bad idea.  Besides, the loop per se is not
+fatal; if, for example, in the second case the descriptor for A had been
+kept around, close(accept()) would've cleaned everything up.  Which means
+that we need a garbage collector to deal with the (rare) leaks.
+
+Note that, in both cases, the leaks are caused by loops passing through
+some SCM_RIGHTS datagrams that can never be received.  So locating those,
+removing them from the queues they sit in and then discarding the suckers,
+is enough to resolve the situation. Furthermore, in both cases the loop
+passes through the unix_sock of something that got sent over in an
+SCM_RIGHTS datagram.  So we can do the following:
+
+  1) Keep the count of references to file structures of AF_UNIX sockets
+     held by SCM_RIGHTS; this value is kept in unix_sock->inflight.  Any
+     struct unix_sock instance without such references is not a part of
+     unreachable loop.  Maintain the set of unix_sock that are not excluded
+     by that (i.e. the ones that have some of references from SCM_RIGHTS
+     instances).  Note that we don't need to maintain those counts in
+     struct file; we care only about unix_sock here.
+
+  2) Any struct file of an AF_UNIX socket with some references *NOT* from
+     SCM_RIGHTS datagrams is also not a part of unreachable loop.
+
+  3) For each unix_sock, consider the following set of SCM_RIGHTS
+     datagrams: everything in the queue of that unix_sock if it's a
+     non-listener, and everything in queues of *all* embryonic unix_sock
+     structs in the queue of a listener.  Let's call those the SCM_RIGHTS
+     associated with our unix_sock.
+
+  4) All SCM_RIGHTS associated with a reachable unix_sock are themselves
+     reachable.
+
+  5) if some references to the struct file of a unix_sock are in reachable
+     SCM_RIGHTS, that struct file is reachable.
+
+The garbage collector starts with calculating the set of potentially
+unreachable unix_socks:  the ones not excluded by (1, 2).  No unix_sock
+instances outside of that set need to be considered.
+
+If some unix_sock in that set has a counter that is *not* entirely covered
+by SCM_RIGHTS associated with the elements of the set, we can conclude that
+there are references to it in SCM_RIGHTS associated with something outside
+of our set and therefore it is reachable and can be removed from the set.
+
+If that process converges to a non-empty set, we know that everything left
+in that set is unreachable - all references to their struct file come from
+some SCM_RIGHTS datagrams, and all those SCM_RIGHTS datagrams are among
+those that can't be received or discarded without getting hold of a
+reference to struct file of something in our set.
+
+Everything outside of that set is reachable, so taking the SCM_RIGHTS with
+references to stuff in our set (all of them to be found among those
+associated with elements of our set) out of the queues they are in will
+break all unreachable loops.  Discarding the collected datagrams will do
+the rest - the file references in those will be dropped, etc.
+
+One thing to keep in mind here is the locking.  What the garbage
+collector relies upon is:
+
+  * Changes to ->inflight are serialized with respect to it (on
+    unix_gc_lock; increments are done by unix_inflight(), decrements by
+    unix_notinflight()).
+
+  * Any references extracted from SCM_RIGHTS during the garbage collector
+    run will not be actually used until the end of garbage collection.  For
+    a normal recvmsg() call, this behavior is guaranteed by having
+    unix_notinflight() called between the extraction of scm_fp_list from
+    the packet and doing anything else with the references extracted.  For
+    a MSG_PEEK recvmsg() call, it's actually broken and lacks
+    synchronization; Miklos has proposed to grab and release unix_gc_lock
+    in those, between scm_fp_dup() and doing anything else with the
+    references copied.
+
+.. FIXME: The above should be updates when the fix happens.
+
+  * adding SCM_RIGHTS in the middle of garbage collection is possible, but
+    in that case it will contain no references to anything in the initial
+    candidate set.
+
+The last one is delicate.  SCM_RIGHTS creation has unix_inflight() called
+for each reference we put there, so it's serialized with respect to
+unix_gc(); however, insertion into the queue is *NOT* covered by that.
+Queue rescans are covered, but each queue has a lock of its own and they
+are definitely not going to be held throughout the whole thing.
+
+So in theory it would be possible to have:
+
+  * thread A: sendmsg() has SCM_RIGHTS created and populated, complete with
+    file refcount and ->inflight increments implied, at which point it gets
+    preempted and loses the timeslice.
+
+  * thread B: gets to run and removes all references from descriptor table
+    it shares with thread A.
+
+  * on another CPU we have the garbage collector triggered; it determines
+    the set of potentially unreachable unix_sock and everything in our
+    SCM_RIGHTS _is_ in that set, now that no other references remain.
+
+  * on the first CPU, thread A regains the timeslice and inserts its
+    SCM_RIGHTS into queue.  And it does contain references to sockets from
+    the candidate set of running garbage collector, confusing the hell out
+    of it.
+
+That is avoided by a convoluted dance around the SCM_RIGHTS creation
+and insertion - we use fget() to obtain struct file references,
+then _duplicate_ them in SCM_RIGHTS (bumping a refcount for each, so
+we are holding *two* references), do unix_inflight() on them, then
+queue the damn thing, then drop each reference we got from fget().
+
+That way everything referred to in that SCM_RIGHTS is going to have
+extra struct file references (and thus be excluded from the initial
+candidate set) until after it gets inserted into queue.  In other
+words, if it does appear in a queue between two passes, it's
+guaranteed to contain no references to anything in the initial
+canidate set.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-04 20:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21 19:02 [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
2019-02-21 19:02 ` [PATCH 1/2] docs: Bring some order to filesystem documentation Jonathan Corbet
2019-02-21 19:02 ` [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info Jonathan Corbet
2019-02-21 19:06 ` [PATCH 0/2] docs: Add some struct file refcounting information Jonathan Corbet
2019-03-04 20:01 [RESEND PATCH " Jonathan Corbet
2019-03-04 20:01 ` [PATCH 2/2] docs: Add struct file refcounting and SCM_RIGHTS mess info Jonathan Corbet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).