Re: Easily reproducible deadlock on 5.4-rc4 with cifsInodeInfo.lock_sem semaphore after network disruption or smb restart

From: David Wysochanski <dwysocha@redhat.com>
To: Pavel Shilovsky <piastryyy@gmail.com>
Cc: Steve French <smfrench@gmail.com>,
	Ronnie Sahlberg <lsahlber@redhat.com>,
	linux-cifs <linux-cifs@vger.kernel.org>
Subject: Re: Easily reproducible deadlock on 5.4-rc4 with cifsInodeInfo.lock_sem semaphore after network disruption or smb restart
Date: Tue, 22 Oct 2019 15:00:34 -0400	[thread overview]
Message-ID: <CALF+zOkrxNMAr==jFdfqOfojbg9thjY4bKLuO2s4m_gs3jEr=Q@mail.gmail.com> (raw)
In-Reply-To: <CALF+zOkqMLL05W6i_VZf17G+exz2PNObNwATVeU2GAZFFQyE6A@mail.gmail.com>

On Tue, Oct 22, 2019 at 12:43 PM David Wysochanski <dwysocha@redhat.com> wrote:
>
> On Tue, Oct 22, 2019 at 12:09 PM Pavel Shilovsky <piastryyy@gmail.com> wrote:
> >
> > пн, 21 окт. 2019 г. в 20:05, David Wysochanski <dwysocha@redhat.com>:
> > > > ? If yes, then we can create another work queue that will be
> > > > re-locking files that have been re-opened. In this case
> > > > cifs_reopen_file will simply submit the "re-lock" request to the queue
> > > > without the need to grab the lock one more time.
> > > >
> > > I think you're approach of a work queue may work but it would have to
> > > be on a per-pid basis, as would any solution.
> >
> > Why do you think we should have a queue per pid? What are possible
> > problems to have a single queue per SMB connection or Share (Tree)
> > connection?
> >
>
> What you're proposing may work but would have to see the code and how
> lock_sem would be used.  If you're serializing the relocking to only
> one pid it should be fine.
>
> I can describe what I tried (attached patch) and why it did not work.
> Here's the problem code path.  Note that down_read(lock_sem) #2 is
> only a problem if down_write(lock_sem) happens between #1 and #2, but
> we're trying to avoid #2 to avoid the deadlock.
>
> PID: 5464   TASK: ffff9862640b2f80  CPU: 1   COMMAND: "python3-openloo"
>  #0 [ffffbd2d81203a20] __schedule at ffffffff9aa629e3
>  #1 [ffffbd2d81203ac0] schedule at ffffffff9aa62fe8
>  #2 [ffffbd2d81203ac8] rwsem_down_read_failed at ffffffff9aa660cd
>  #3 [ffffbd2d81203b58] cifs_reopen_file at ffffffffc0529dce [cifs]
> <--- down_read (lock_sem) #2
>  #4 [ffffbd2d81203c00] cifs_read.constprop.46 at ffffffffc052bbe0 [cifs]
>  #5 [ffffbd2d81203ca8] cifs_readpage_worker at ffffffffc052c0a7 [cifs]
>  #6 [ffffbd2d81203ce0] cifs_readpage at ffffffffc052c427 [cifs]
>  #7 [ffffbd2d81203d20] generic_file_buffered_read at ffffffff9a40f2b1
>  #8 [ffffbd2d81203e10] cifs_strict_readv at ffffffffc0533a7b [cifs]
> <---- down_read (lock_sem) #1
>  #9 [ffffbd2d81203e40] new_sync_read at ffffffff9a4b6e21
> #10 [ffffbd2d81203ec8] vfs_read at ffffffff9a4b9671
> #11 [ffffbd2d81203f00] ksys_read at ffffffff9a4b9aaf
> #12 [ffffbd2d81203f38] do_syscall_64 at ffffffff9a2041cb
>
> Note that each of these down_read() calls are per-pid (each pid has to
> get the sem).   I tried this patch:
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -1404,6 +1404,7 @@ struct cifsInodeInfo {
>  #define CIFS_INO_DELETE_PENDING                  (3) /* delete
> pending on server */
>  #define CIFS_INO_INVALID_MAPPING         (4) /* pagecache is invalid */
>  #define CIFS_INO_LOCK                    (5) /* lock bit for synchronization */
> +#define CIFS_LOCK_SEM_HELD               (6) /* the lock_sem is held */
>         unsigned long flags;
>         spinlock_t writers_lock;
>         unsigned int writers;           /* Number of writers on this inode */
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 5ad15de2bb4f..153adcd3ad16 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -620,11 +620,15 @@ cifs_relock_file(struct cifsFileInfo *cfile)
>         struct cifsInodeInfo *cinode = CIFS_I(d_inode(cfile->dentry));
>         struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
>         int rc = 0;
> +       int lock_sem_held = 0;
>
> -       down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING);
> +       lock_sem_held = test_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
> +       if (!lock_sem_held)
> +               down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING);
>         if (cinode->can_cache_brlcks) {
>                 /* can cache locks - no need to relock */
> -               up_read(&cinode->lock_sem);
> +               if (!lock_sem_held)
> +                       up_read(&cinode->lock_sem);
>                 return rc;
>         }
>
> @@ -635,7 +639,8 @@ cifs_relock_file(struct cifsFileInfo *cfile)
>         else
>                 rc = tcon->ses->server->ops->push_mand_locks(cfile);
>
> -       up_read(&cinode->lock_sem);
> +       if (!lock_sem_held)
> +               up_read(&cinode->lock_sem);
>         return rc;
>  }
>
> @@ -3888,11 +3893,13 @@ cifs_strict_readv(struct kiocb *iocb, struct
> iov_iter *to)
>          * with a brlock that prevents reading.
>          */
>         down_read(&cinode->lock_sem);
> +       set_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
>         if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
>                                      tcon->ses->server->vals->shared_lock_type,
>                                      0, NULL, CIFS_READ_OP))
>                 rc = generic_file_read_iter(iocb, to);
>         up_read(&cinode->lock_sem);
> +       clear_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
>         return rc;
>  }
>
>
> Note that the bit is associated with the cifsInodeInfo file, and we
> only have one file but multiple pids so it's not an effective
> solution.  To describe, the patch doesn't work because when the server
> goes down, you have many processes all having passed down_read() #1,
> and they are blocked waiting for a response.  When the server comes
> back at least one will see the bit is set and will skip down_read()
> #2, but then it will proceed to finish the cifs_strict_readv(), and so
> clear the bit, and other pids may not see the bit set anymore, and
> will call down_read() #2, leading to the same deadlock.

One other thing to think about here is this.  The other part of the
deadlock, as seen in the reproducer, is due to this process:

PID: 5411   TASK: ffff9862788fc740  CPU: 3   COMMAND: "python3-openloo"
 #0 [ffffbd2d80fbfc60] __schedule at ffffffff9aa629e3
 #1 [ffffbd2d80fbfd00] schedule at ffffffff9aa62fe8
 #2 [ffffbd2d80fbfd08] rwsem_down_write_failed at ffffffff9aa65dbc
 #3 [ffffbd2d80fbfda8] down_write at ffffffff9aa65be9
 #4 [ffffbd2d80fbfdb8] _cifsFileInfo_put at ffffffffc052da65 [cifs]
 #5 [ffffbd2d80fbfe98] cifs_close at ffffffffc052dd3f [cifs]
 #6 [ffffbd2d80fbfea8] __fput at ffffffff9a4ba5c7
 #7 [ffffbd2d80fbfee8] task_work_run at ffffffff9a2d1c5a
 #8 [ffffbd2d80fbff20] exit_to_usermode_loop at ffffffff9a203b4b
 #9 [ffffbd2d80fbff38] do_syscall_64 at ffffffff9a2042f2
#10 [ffffbd2d80fbff50] entry_SYSCALL_64_after_hwframe at ffffffff9ac000ad

What's that process doing?  Well it is blocked on down_write()
deleting and freeing all entries on cifs_file->llist->locks
void _cifsFileInfo_put(struct cifsFileInfo *cifs_file, bool wait_oplock_handler)

    /*
     * Delete any outstanding lock records. We'll lose them when the file
     * is closed anyway.
     */
    down_write(&cifsi->lock_sem);
    list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) {
        list_del(&li->llist);
        cifs_del_lock_waiters(li);
        kfree(li);
    }

The other trace we see where someone is trying to down_write is here:

PID: 14207  TASK: ffff8ac1f3369e40  CPU: 6   COMMAND: "python3-openloo"
 #0 [ffff9cfc41ccb990] __schedule at ffffffff9f9d2d29
 #1 [ffff9cfc41ccba20] schedule at ffffffff9f9d31a9
 #2 [ffff9cfc41ccba38] rwsem_down_write_slowpath at ffffffff9f131b0a
 #3 [ffff9cfc41ccbb08] cifs_new_fileinfo at ffffffffc065833a [cifs]
 #4 [ffff9cfc41ccbb78] cifs_open at ffffffffc0658a8c [cifs]
 #5 [ffff9cfc41ccbcd0] do_dentry_open at ffffffff9f2f09ca
 #6 [ffff9cfc41ccbd00] path_openat at ffffffff9f3050ad
 #7 [ffff9cfc41ccbdd8] do_filp_open at ffffffff9f307431
 #8 [ffff9cfc41ccbee0] do_sys_open at ffffffff9f2f2154
 #9 [ffff9cfc41ccbf38] do_syscall_64 at ffffffff9f0042bb
#10 [ffff9cfc41ccbf50] entry_SYSCALL_64_after_hwframe at ffffffff9fa0008c

What is that process doing?  It is creating a new empty list of
cifs_fid_locks and adding that onto cinode->llist:
cifs_new_fileinfo(struct cifs_fid *fid, struct file *file,
          struct tcon_link *tlink, __u32 oplock)
...
    cfile = kzalloc(sizeof(struct cifsFileInfo), GFP_KERNEL);
    if (cfile == NULL)
        return cfile;

    fdlocks = kzalloc(sizeof(struct cifs_fid_locks), GFP_KERNEL);
    if (!fdlocks) {
        kfree(cfile);
        return NULL;
    }

    INIT_LIST_HEAD(&fdlocks->locks);
    fdlocks->cfile = cfile;
    cfile->llist = fdlocks;
    down_write(&cinode->lock_sem);
    list_add(&fdlocks->llist, &cinode->llist);
    up_write(&cinode->lock_sem);

So back to the reader, why did he down_read() and hold the semaphore
and go to sleep waiting for the read response (or a reconnect)?
cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
...
    /*
     * We need to hold the sem to be sure nobody modifies lock list
     * with a brlock that prevents reading.
     */
    down_read(&cinode->lock_sem);
    if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
                     tcon->ses->server->vals->shared_lock_type,
                     0, NULL, CIFS_READ_OP))
        rc = generic_file_read_iter(iocb, to);
    up_read(&cinode->lock_sem);

Neither of the above two callers of down_write() are doing the
operation that down_read #1 is meaning to avoid - "modify the lock
list with a brlock that prevents reading".
So if there is some way to pull that piece out, then you can solve the
deadlock.  The existing use of lock_sem to both a) protect reads
against a lock that prevents reading, and b) protect modifications to
cinode->llist causes unnecessary contention of the semaphore and this
deadlock.