linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Jeff Layton <jlayton@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-bcachefs@vger.kernel.org,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>
Subject: Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
Date: Sun, 14 May 2023 22:39:56 -0400	[thread overview]
Message-ID: <ZGGbfFFkZZruo8J/@moria.home.lan> (raw)
In-Reply-To: <e87c74c05e79a01b160ce0ae81a9ef4229670930.camel@kernel.org>

On Sun, May 14, 2023 at 08:15:20AM -0400, Jeff Layton wrote:
> So the idea is to create a fundamentally unfair rwsem? One that always
> prefers readers over writers?

No, not sure where you're getting that from. It's unfair, but writes are
preferred over readers :)

> 
> > + * Other operations:
> > + *
> > + *   six_trylock_read()
> > + *   six_trylock_intent()
> > + *   six_trylock_write()
> > + *
> > + *   six_lock_downgrade():	convert from intent to read
> > + *   six_lock_tryupgrade():	attempt to convert from read to intent
> > + *
> > + * Locks also embed a sequence number, which is incremented when the lock is
> > + * locked or unlocked for write. The current sequence number can be grabbed
> > + * while a lock is held from lock->state.seq; then, if you drop the lock you can
> > + * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
> > + * iff it hasn't been locked for write in the meantime.
> > + *
> 
> ^^^
> This is a cool idea.

It's used heavily in bcachefs so we can drop locks if we might be
blocking - and then relock and continue, at the cost of a transaction
restart if the relock fails. It's a huge win for tail latency.

> > + * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
> > + *
> > + *   six_lock_type(lock, type)
> > + *   six_unlock_type(lock, type)
> > + *   six_relock(lock, type, seq)
> > + *   six_trylock_type(lock, type)
> > + *   six_trylock_convert(lock, from, to)
> > + *
> > + * A lock may be held multiple types by the same thread (for read or intent,
> > + * not write). However, the six locks code does _not_ implement the actual
> > + * recursive checks itself though - rather, if your code (e.g. btree iterator
> > + * code) knows that the current thread already has a lock held, and for the
> > + * correct type, six_lock_increment() may be used to bump up the counter for
> > + * that type - the only effect is that one more call to unlock will be required
> > + * before the lock is unlocked.
> 
> Thse semantics are a bit confusing. Once you hold a read or intent lock,
> you can take it as many times as you like. What happens if I take it in
> one context and release it in another? Say, across a workqueue job for
> instance?

Not allowed because of lockdep, same as with other locks.

> Are intent locks "converted" to write locks, or do they stack? For
> instance, suppose I take the intent lock 3 times and then take a write
> lock. How many times do I have to call unlock to fully release it (3 or
> 4)? If I release it just once, do I still hold the write lock or am I
> back to "intent" state?

They stack. You'd call unlock_write() once ad unlock_intent() three
times.

> Some basic info about the underlying design would be nice here. What
> info is tracked in the union below? When are different members being
> used? How does the code decide which way to cast this thing? etc.

The field names seem pretty descriptive to me.

counter, v are just for READ_ONCE/atomic64 cmpxchg ops.

> Ewww...bitfields. That seems a bit scary in a union. There is no
> guarantee that the underlying arch will even pack that into a single
> word, AIUI. It may be safer to do this with masking and shifting
> instead.

It wouldn't hurt to add a BUILD_BUG_ON() for the size, but I don't find
anything "scary" about unions and bitfields :)

And it makes the code more descriptive and readable than masking and
shifting.

> > +static __always_inline bool do_six_trylock_type(struct six_lock *lock,
> > +						enum six_lock_type type,
> > +						bool try)
> > +{
> > +	const struct six_lock_vals l[] = LOCK_VALS;
> > +	union six_lock_state old, new;
> > +	bool ret;
> > +	u64 v;
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
> > +	EBUG_ON(type == SIX_LOCK_write && (lock->state.seq & 1));
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && (try != !(lock->state.write_locking)));
> > +
> > +	/*
> > +	 * Percpu reader mode:
> > +	 *
> > +	 * The basic idea behind this algorithm is that you can implement a lock
> > +	 * between two threads without any atomics, just memory barriers:
> > +	 *
> > +	 * For two threads you'll need two variables, one variable for "thread a
> > +	 * has the lock" and another for "thread b has the lock".
> > +	 *
> > +	 * To take the lock, a thread sets its variable indicating that it holds
> > +	 * the lock, then issues a full memory barrier, then reads from the
> > +	 * other thread's variable to check if the other thread thinks it has
> > +	 * the lock. If we raced, we backoff and retry/sleep.
> > +	 */
> > +
> > +	if (type == SIX_LOCK_read && lock->readers) {
> > +retry:
> > +		preempt_disable();
> > +		this_cpu_inc(*lock->readers); /* signal that we own lock */
> > +
> > +		smp_mb();
> > +
> > +		old.v = READ_ONCE(lock->state.v);
> > +		ret = !(old.v & l[type].lock_fail);
> > +
> > +		this_cpu_sub(*lock->readers, !ret);
> > +		preempt_enable();
> > +
> > +		/*
> > +		 * If we failed because a writer was trying to take the
> > +		 * lock, issue a wakeup because we might have caused a
> > +		 * spurious trylock failure:
> > +		 */
> > +		if (old.write_locking) {
> > +			struct task_struct *p = READ_ONCE(lock->owner);
> > +
> > +			if (p)
> > +				wake_up_process(p);
> > +		}
> > +
> > +		/*
> > +		 * If we failed from the lock path and the waiting bit wasn't
> > +		 * set, set it:
> > +		 */
> > +		if (!try && !ret) {
> > +			v = old.v;
> > +
> > +			do {
> > +				new.v = old.v = v;
> > +
> > +				if (!(old.v & l[type].lock_fail))
> > +					goto retry;
> > +
> > +				if (new.waiters & (1 << type))
> > +					break;
> > +
> > +				new.waiters |= 1 << type;
> > +			} while ((v = atomic64_cmpxchg(&lock->state.counter,
> > +						       old.v, new.v)) != old.v);
> > +		}
> > +	} else if (type == SIX_LOCK_write && lock->readers) {
> > +		if (try) {
> > +			atomic64_add(__SIX_VAL(write_locking, 1),
> > +				     &lock->state.counter);
> > +			smp_mb__after_atomic();
> > +		}
> > +
> > +		ret = !pcpu_read_count(lock);
> > +
> > +		/*
> > +		 * On success, we increment lock->seq; also we clear
> > +		 * write_locking unless we failed from the lock path:
> > +		 */
> > +		v = 0;
> > +		if (ret)
> > +			v += __SIX_VAL(seq, 1);
> > +		if (ret || try)
> > +			v -= __SIX_VAL(write_locking, 1);
> > +
> > +		if (try && !ret) {
> > +			old.v = atomic64_add_return(v, &lock->state.counter);
> > +			six_lock_wakeup(lock, old, SIX_LOCK_read);
> > +		} else {
> > +			atomic64_add(v, &lock->state.counter);
> > +		}
> > +	} else {
> > +		v = READ_ONCE(lock->state.v);
> > +		do {
> > +			new.v = old.v = v;
> > +
> > +			if (!(old.v & l[type].lock_fail)) {
> > +				new.v += l[type].lock_val;
> > +
> > +				if (type == SIX_LOCK_write)
> > +					new.write_locking = 0;
> > +			} else if (!try && type != SIX_LOCK_write &&
> > +				   !(new.waiters & (1 << type)))
> > +				new.waiters |= 1 << type;
> > +			else
> > +				break; /* waiting bit already set */
> > +		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
> > +					old.v, new.v)) != old.v);
> > +
> > +		ret = !(old.v & l[type].lock_fail);
> > +
> > +		EBUG_ON(ret && !(lock->state.v & l[type].held_mask));
> > +	}
> > +
> > +	if (ret)
> > +		six_set_owner(lock, type, old);
> > +
> > +	EBUG_ON(type == SIX_LOCK_write && (try || ret) && (lock->state.write_locking));
> > +
> > +	return ret;
> > +}
> > +
> 
> ^^^
> I'd really like to see some more comments in the code above. It's pretty
> complex.

It's already got more comments than is typical for kernel locking code :)

But if there's specific things you'd like to see clarified, please do
point them out.

  reply	other threads:[~2023-05-15  2:40 UTC|newest]

Thread overview: 186+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
2023-05-09 17:04   ` Miguel Ojeda
2023-05-09 17:24     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
2023-05-09 19:30   ` Peter Zijlstra
2023-05-09 20:11     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
2023-05-09 19:31   ` Peter Zijlstra
2023-05-09 19:57     ` Kent Overstreet
2023-05-09 20:18     ` Kent Overstreet
2023-05-09 20:27       ` Waiman Long
2023-05-09 20:35         ` Kent Overstreet
2023-05-09 21:37           ` Waiman Long
2023-05-10  8:59       ` Peter Zijlstra
2023-05-10 20:38         ` Kent Overstreet
2023-05-11  8:25           ` Peter Zijlstra
2023-05-11  9:32             ` Kent Overstreet
2023-05-12 20:49         ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2023-05-11 12:14   ` Jan Engelhardt
2023-05-12 20:58     ` Kent Overstreet
2023-05-12 22:39       ` Jan Engelhardt
2023-05-12 23:26         ` Kent Overstreet
2023-05-12 23:49           ` Randy Dunlap
2023-05-13  0:17             ` Kent Overstreet
2023-05-13  0:45               ` Eric Biggers
2023-05-13  0:51                 ` Kent Overstreet
2023-05-14 12:15   ` Jeff Layton
2023-05-15  2:39     ` Kent Overstreet [this message]
2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
2023-05-10  1:07   ` Jan Kara
2023-05-10  6:18     ` Kent Overstreet
2023-05-23 13:34       ` Jan Kara
2023-05-23 16:21         ` [Cluster-devel] " Christoph Hellwig
2023-05-23 16:35           ` Kent Overstreet
2023-05-24  6:43             ` Christoph Hellwig
2023-05-24  8:09               ` Kent Overstreet
2023-05-25  8:58                 ` Christoph Hellwig
2023-05-25 20:50                   ` Kent Overstreet
2023-05-26  8:06                     ` Christoph Hellwig
2023-05-26  8:34                       ` Kent Overstreet
2023-05-25 21:40                   ` Kent Overstreet
2023-05-25 22:25           ` Andreas Grünbacher
2023-05-25 23:20             ` Kent Overstreet
2023-05-26  0:05               ` Andreas Grünbacher
2023-05-26  0:39                 ` Kent Overstreet
2023-05-26  8:10               ` Christoph Hellwig
2023-05-26  8:38                 ` Kent Overstreet
2023-05-23 16:49         ` Kent Overstreet
2023-05-25  8:47           ` Jan Kara
2023-05-25 21:36             ` Kent Overstreet
2023-05-25 22:45             ` Andreas Grünbacher
2023-05-25 22:04         ` Andreas Grünbacher
2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
2023-05-09 18:19   ` Lorenzo Stoakes
2023-05-09 20:15     ` Kent Overstreet
2023-05-09 20:46   ` Christoph Hellwig
2023-05-09 21:12     ` Lorenzo Stoakes
2023-05-09 21:29       ` Kent Overstreet
2023-05-10  6:48         ` Eric Biggers
2023-05-12 18:36           ` Kent Overstreet
2023-05-13  1:57             ` Eric Biggers
2023-05-13 19:28               ` Kent Overstreet
2023-05-14  5:45               ` Kent Overstreet
2023-05-14 18:43                 ` Eric Biggers
2023-05-15  5:38                   ` Kent Overstreet
2023-05-15  6:13                     ` Eric Biggers
2023-05-15  6:18                       ` Kent Overstreet
2023-05-15  7:13                         ` Eric Biggers
2023-05-15  7:26                           ` Kent Overstreet
2023-05-21 21:33                             ` Eric Biggers
2023-05-21 22:04                               ` Kent Overstreet
2023-05-15 10:29                 ` David Laight
2023-05-10 11:56         ` David Laight
2023-05-09 21:43       ` Darrick J. Wong
2023-05-09 21:54         ` Kent Overstreet
2023-05-11  5:33           ` Theodore Ts'o
2023-05-11  5:44             ` Kent Overstreet
2023-05-13 13:25       ` Lorenzo Stoakes
2023-05-14 18:39         ` Christophe Leroy
2023-05-14 23:43           ` Kent Overstreet
2023-05-15  4:45             ` Christophe Leroy
2023-05-15  5:02               ` Kent Overstreet
2023-05-10 14:18   ` Christophe Leroy
2023-05-10 15:05   ` Johannes Thumshirn
2023-05-11 22:28     ` Kees Cook
2023-05-12 18:41       ` Kent Overstreet
2023-05-16 21:02         ` Kees Cook
2023-05-16 21:20           ` Kent Overstreet
2023-05-16 21:47             ` Matthew Wilcox
2023-05-16 21:57               ` Kent Overstreet
2023-05-17  5:28               ` Kent Overstreet
2023-05-17 14:04                 ` Mike Rapoport
2023-05-17 14:18                   ` Kent Overstreet
2023-05-17 15:44                     ` Mike Rapoport
2023-05-17 15:59                       ` Kent Overstreet
2023-06-17  4:13             ` Andy Lutomirski
2023-06-17 15:34               ` Kent Overstreet
2023-06-17 19:19                 ` Andy Lutomirski
2023-06-17 20:08                   ` Kent Overstreet
2023-06-17 20:35                     ` Andy Lutomirski
2023-06-19 19:45                 ` Kees Cook
2023-06-20  0:39                   ` Kent Overstreet
2023-06-19  9:19   ` Mark Rutland
2023-06-19 10:47     ` Kent Overstreet
2023-06-19 12:47       ` Mark Rutland
2023-06-19 19:17         ` Kent Overstreet
2023-06-20 17:42           ` Andy Lutomirski
2023-06-20 18:08             ` Kent Overstreet
2023-06-20 18:15               ` Andy Lutomirski
2023-06-20 18:48                 ` Dave Hansen
2023-06-20 20:18                   ` Kent Overstreet
2023-06-20 20:42                   ` Andy Lutomirski
2023-06-20 22:32                     ` Andy Lutomirski
2023-06-20 22:43                       ` Nadav Amit
2023-06-21  1:27                         ` Andy Lutomirski
2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
2023-05-10  1:10   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
2023-05-09 17:05   ` Coly Li
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
2023-05-23  9:27   ` (subset) " Christian Brauner
2023-05-23 22:53     ` Dave Chinner
2023-05-24  6:44       ` Christoph Hellwig
2023-05-24  7:35         ` Dave Chinner
2023-05-24  8:31           ` Christian Brauner
2023-05-24  8:41             ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
2023-05-10  4:48   ` Dave Chinner
2023-05-23  9:27   ` (subset) " Christian Brauner
2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
2023-05-10  4:45   ` Dave Chinner
2023-05-16 15:45     ` Christian Brauner
2023-05-16 16:17       ` Kent Overstreet
2023-05-16 23:15         ` Dave Chinner
2023-05-22 13:04           ` Christian Brauner
2023-05-23  9:28   ` (subset) " Christian Brauner
2023-10-19 15:30     ` Mateusz Guzik
2023-10-19 15:59       ` Mateusz Guzik
2023-10-20 11:38         ` Dave Chinner
2023-10-20 17:49           ` Mateusz Guzik
2023-10-21 12:13             ` Mateusz Guzik
2023-10-23  5:10             ` Dave Chinner
2023-10-27 17:13               ` Mateusz Guzik
2023-10-27 18:36                 ` Darrick J. Wong
2023-10-31 11:02                 ` Christian Brauner
2023-10-31 11:31                   ` Mateusz Guzik
2023-11-02  2:36                   ` Kent Overstreet
2023-11-04 20:51                     ` Dave Chinner
2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
2023-05-10  2:20   ` kernel test robot
2023-05-11  2:08   ` kernel test robot
2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
2023-06-19  9:10   ` Mark Rutland
2023-06-19 11:16     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
2023-07-12 19:58   ` Kees Cook
2023-07-12 20:19     ` Kent Overstreet
2023-07-12 22:38       ` Kees Cook
2023-07-12 23:53         ` Kent Overstreet
2023-07-12 20:23     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
2023-05-09 21:04   ` Randy Dunlap
2023-05-09 21:07     ` Kent Overstreet
2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
2023-06-15 21:26   ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGGbfFFkZZruo8J/@moria.home.lan \
    --to=kent.overstreet@linux.dev \
    --cc=boqun.feng@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcachefs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).