All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Wysochanski <dwysocha@redhat.com>
To: Pavel Shilovsky <piastryyy@gmail.com>
Cc: Steve French <smfrench@gmail.com>,
	Ronnie Sahlberg <lsahlber@redhat.com>,
	linux-cifs <linux-cifs@vger.kernel.org>
Subject: Re: Easily reproducible deadlock on 5.4-rc4 with cifsInodeInfo.lock_sem semaphore after network disruption or smb restart
Date: Tue, 22 Oct 2019 15:00:34 -0400	[thread overview]
Message-ID: <CALF+zOkrxNMAr==jFdfqOfojbg9thjY4bKLuO2s4m_gs3jEr=Q@mail.gmail.com> (raw)
In-Reply-To: <CALF+zOkqMLL05W6i_VZf17G+exz2PNObNwATVeU2GAZFFQyE6A@mail.gmail.com>

On Tue, Oct 22, 2019 at 12:43 PM David Wysochanski <dwysocha@redhat.com> wrote:
>
> On Tue, Oct 22, 2019 at 12:09 PM Pavel Shilovsky <piastryyy@gmail.com> wrote:
> >
> > пн, 21 окт. 2019 г. в 20:05, David Wysochanski <dwysocha@redhat.com>:
> > > > ? If yes, then we can create another work queue that will be
> > > > re-locking files that have been re-opened. In this case
> > > > cifs_reopen_file will simply submit the "re-lock" request to the queue
> > > > without the need to grab the lock one more time.
> > > >
> > > I think you're approach of a work queue may work but it would have to
> > > be on a per-pid basis, as would any solution.
> >
> > Why do you think we should have a queue per pid? What are possible
> > problems to have a single queue per SMB connection or Share (Tree)
> > connection?
> >
>
> What you're proposing may work but would have to see the code and how
> lock_sem would be used.  If you're serializing the relocking to only
> one pid it should be fine.
>
> I can describe what I tried (attached patch) and why it did not work.
> Here's the problem code path.  Note that down_read(lock_sem) #2 is
> only a problem if down_write(lock_sem) happens between #1 and #2, but
> we're trying to avoid #2 to avoid the deadlock.
>
> PID: 5464   TASK: ffff9862640b2f80  CPU: 1   COMMAND: "python3-openloo"
>  #0 [ffffbd2d81203a20] __schedule at ffffffff9aa629e3
>  #1 [ffffbd2d81203ac0] schedule at ffffffff9aa62fe8
>  #2 [ffffbd2d81203ac8] rwsem_down_read_failed at ffffffff9aa660cd
>  #3 [ffffbd2d81203b58] cifs_reopen_file at ffffffffc0529dce [cifs]
> <--- down_read (lock_sem) #2
>  #4 [ffffbd2d81203c00] cifs_read.constprop.46 at ffffffffc052bbe0 [cifs]
>  #5 [ffffbd2d81203ca8] cifs_readpage_worker at ffffffffc052c0a7 [cifs]
>  #6 [ffffbd2d81203ce0] cifs_readpage at ffffffffc052c427 [cifs]
>  #7 [ffffbd2d81203d20] generic_file_buffered_read at ffffffff9a40f2b1
>  #8 [ffffbd2d81203e10] cifs_strict_readv at ffffffffc0533a7b [cifs]
> <---- down_read (lock_sem) #1
>  #9 [ffffbd2d81203e40] new_sync_read at ffffffff9a4b6e21
> #10 [ffffbd2d81203ec8] vfs_read at ffffffff9a4b9671
> #11 [ffffbd2d81203f00] ksys_read at ffffffff9a4b9aaf
> #12 [ffffbd2d81203f38] do_syscall_64 at ffffffff9a2041cb
>
> Note that each of these down_read() calls are per-pid (each pid has to
> get the sem).   I tried this patch:
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -1404,6 +1404,7 @@ struct cifsInodeInfo {
>  #define CIFS_INO_DELETE_PENDING                  (3) /* delete
> pending on server */
>  #define CIFS_INO_INVALID_MAPPING         (4) /* pagecache is invalid */
>  #define CIFS_INO_LOCK                    (5) /* lock bit for synchronization */
> +#define CIFS_LOCK_SEM_HELD               (6) /* the lock_sem is held */
>         unsigned long flags;
>         spinlock_t writers_lock;
>         unsigned int writers;           /* Number of writers on this inode */
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 5ad15de2bb4f..153adcd3ad16 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -620,11 +620,15 @@ cifs_relock_file(struct cifsFileInfo *cfile)
>         struct cifsInodeInfo *cinode = CIFS_I(d_inode(cfile->dentry));
>         struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
>         int rc = 0;
> +       int lock_sem_held = 0;
>
> -       down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING);
> +       lock_sem_held = test_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
> +       if (!lock_sem_held)
> +               down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING);
>         if (cinode->can_cache_brlcks) {
>                 /* can cache locks - no need to relock */
> -               up_read(&cinode->lock_sem);
> +               if (!lock_sem_held)
> +                       up_read(&cinode->lock_sem);
>                 return rc;
>         }
>
> @@ -635,7 +639,8 @@ cifs_relock_file(struct cifsFileInfo *cfile)
>         else
>                 rc = tcon->ses->server->ops->push_mand_locks(cfile);
>
> -       up_read(&cinode->lock_sem);
> +       if (!lock_sem_held)
> +               up_read(&cinode->lock_sem);
>         return rc;
>  }
>
> @@ -3888,11 +3893,13 @@ cifs_strict_readv(struct kiocb *iocb, struct
> iov_iter *to)
>          * with a brlock that prevents reading.
>          */
>         down_read(&cinode->lock_sem);
> +       set_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
>         if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
>                                      tcon->ses->server->vals->shared_lock_type,
>                                      0, NULL, CIFS_READ_OP))
>                 rc = generic_file_read_iter(iocb, to);
>         up_read(&cinode->lock_sem);
> +       clear_bit(CIFS_LOCK_SEM_HELD, &cinode->flags);
>         return rc;
>  }
>
>
> Note that the bit is associated with the cifsInodeInfo file, and we
> only have one file but multiple pids so it's not an effective
> solution.  To describe, the patch doesn't work because when the server
> goes down, you have many processes all having passed down_read() #1,
> and they are blocked waiting for a response.  When the server comes
> back at least one will see the bit is set and will skip down_read()
> #2, but then it will proceed to finish the cifs_strict_readv(), and so
> clear the bit, and other pids may not see the bit set anymore, and
> will call down_read() #2, leading to the same deadlock.


One other thing to think about here is this.  The other part of the
deadlock, as seen in the reproducer, is due to this process:

PID: 5411   TASK: ffff9862788fc740  CPU: 3   COMMAND: "python3-openloo"
 #0 [ffffbd2d80fbfc60] __schedule at ffffffff9aa629e3
 #1 [ffffbd2d80fbfd00] schedule at ffffffff9aa62fe8
 #2 [ffffbd2d80fbfd08] rwsem_down_write_failed at ffffffff9aa65dbc
 #3 [ffffbd2d80fbfda8] down_write at ffffffff9aa65be9
 #4 [ffffbd2d80fbfdb8] _cifsFileInfo_put at ffffffffc052da65 [cifs]
 #5 [ffffbd2d80fbfe98] cifs_close at ffffffffc052dd3f [cifs]
 #6 [ffffbd2d80fbfea8] __fput at ffffffff9a4ba5c7
 #7 [ffffbd2d80fbfee8] task_work_run at ffffffff9a2d1c5a
 #8 [ffffbd2d80fbff20] exit_to_usermode_loop at ffffffff9a203b4b
 #9 [ffffbd2d80fbff38] do_syscall_64 at ffffffff9a2042f2
#10 [ffffbd2d80fbff50] entry_SYSCALL_64_after_hwframe at ffffffff9ac000ad


What's that process doing?  Well it is blocked on down_write()
deleting and freeing all entries on cifs_file->llist->locks
void _cifsFileInfo_put(struct cifsFileInfo *cifs_file, bool wait_oplock_handler)

    /*
     * Delete any outstanding lock records. We'll lose them when the file
     * is closed anyway.
     */
    down_write(&cifsi->lock_sem);
    list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) {
        list_del(&li->llist);
        cifs_del_lock_waiters(li);
        kfree(li);
    }


The other trace we see where someone is trying to down_write is here:

PID: 14207  TASK: ffff8ac1f3369e40  CPU: 6   COMMAND: "python3-openloo"
 #0 [ffff9cfc41ccb990] __schedule at ffffffff9f9d2d29
 #1 [ffff9cfc41ccba20] schedule at ffffffff9f9d31a9
 #2 [ffff9cfc41ccba38] rwsem_down_write_slowpath at ffffffff9f131b0a
 #3 [ffff9cfc41ccbb08] cifs_new_fileinfo at ffffffffc065833a [cifs]
 #4 [ffff9cfc41ccbb78] cifs_open at ffffffffc0658a8c [cifs]
 #5 [ffff9cfc41ccbcd0] do_dentry_open at ffffffff9f2f09ca
 #6 [ffff9cfc41ccbd00] path_openat at ffffffff9f3050ad
 #7 [ffff9cfc41ccbdd8] do_filp_open at ffffffff9f307431
 #8 [ffff9cfc41ccbee0] do_sys_open at ffffffff9f2f2154
 #9 [ffff9cfc41ccbf38] do_syscall_64 at ffffffff9f0042bb
#10 [ffff9cfc41ccbf50] entry_SYSCALL_64_after_hwframe at ffffffff9fa0008c

What is that process doing?  It is creating a new empty list of
cifs_fid_locks and adding that onto cinode->llist:
cifs_new_fileinfo(struct cifs_fid *fid, struct file *file,
          struct tcon_link *tlink, __u32 oplock)
...
    cfile = kzalloc(sizeof(struct cifsFileInfo), GFP_KERNEL);
    if (cfile == NULL)
        return cfile;

    fdlocks = kzalloc(sizeof(struct cifs_fid_locks), GFP_KERNEL);
    if (!fdlocks) {
        kfree(cfile);
        return NULL;
    }

    INIT_LIST_HEAD(&fdlocks->locks);
    fdlocks->cfile = cfile;
    cfile->llist = fdlocks;
    down_write(&cinode->lock_sem);
    list_add(&fdlocks->llist, &cinode->llist);
    up_write(&cinode->lock_sem);



So back to the reader, why did he down_read() and hold the semaphore
and go to sleep waiting for the read response (or a reconnect)?
cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
...
    /*
     * We need to hold the sem to be sure nobody modifies lock list
     * with a brlock that prevents reading.
     */
    down_read(&cinode->lock_sem);
    if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
                     tcon->ses->server->vals->shared_lock_type,
                     0, NULL, CIFS_READ_OP))
        rc = generic_file_read_iter(iocb, to);
    up_read(&cinode->lock_sem);


Neither of the above two callers of down_write() are doing the
operation that down_read #1 is meaning to avoid - "modify the lock
list with a brlock that prevents reading".
So if there is some way to pull that piece out, then you can solve the
deadlock.  The existing use of lock_sem to both a) protect reads
against a lock that prevents reading, and b) protect modifications to
cinode->llist causes unnecessary contention of the semaphore and this
deadlock.

      reply	other threads:[~2019-10-22 19:01 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-21 17:22 Easily reproducible deadlock on 5.4-rc4 with cifsInodeInfo.lock_sem semaphore after network disruption or smb restart David Wysochanski
2019-10-21 22:19 ` Pavel Shilovsky
2019-10-22  3:04   ` David Wysochanski
2019-10-22 16:09     ` Pavel Shilovsky
2019-10-22 16:43       ` David Wysochanski
2019-10-22 19:00         ` David Wysochanski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALF+zOkrxNMAr==jFdfqOfojbg9thjY4bKLuO2s4m_gs3jEr=Q@mail.gmail.com' \
    --to=dwysocha@redhat.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=lsahlber@redhat.com \
    --cc=piastryyy@gmail.com \
    --cc=smfrench@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.