All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>, Jan Beulich <jbeulich@novell.com>,
	tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	linux-kernel@vger.kernel.org, Gautham R Shenoy <ego@in.ibm.com>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	netdev@vger.kernel.org
Subject: Re: [PATCH 2/2] sysctl:  lockdep support for sysctl reference counting.
Date: Tue, 31 Mar 2009 15:44:11 -0700	[thread overview]
Message-ID: <m1hc198g90.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <1238513726.8530.564.camel@twins> (Peter Zijlstra's message of "Tue\, 31 Mar 2009 17\:35\:26 +0200")

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, 2009-03-31 at 06:40 -0700, Eric W. Biederman wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Sat, 2009-03-21 at 00:42 -0700, Eric W. Biederman wrote:
>> >> It is possible for get lock ordering deadlocks between locks
>> >> and waiting for the sysctl used count to drop to zero.  We have
>> >> recently observed one of these in the networking code.
>> >> 
>> >> So teach the sysctl code how to speak lockdep so the kernel
>> >> can warn about these kinds of rare issues proactively.
>> >
>> > It would be very good to extend this changelog with a more detailed
>> > explanation of the deadlock in question.
>> >
>> > Let me see if I got it right:
>> >
>> > We're holding a lock, while waiting for the refcount to drop to 0.
>> > Dropping that refcount is blocked on that lock.
>> >
>> > Something like that?
>> 
>> Exactly.
>> 
>> I must have written an explanation so many times that it got
>> lost when I wrote that commit message.
>> 
>> In particular the problem can be see with /proc/sys/net/ipv4/conf/*/forwarding.
>> 
>> The problem is that the handler for fowarding takes the rtnl_lock
>> with the reference count held.
>> 
>> Then we call unregister_sysctl_table under the rtnl_lock.
>> which waits for the reference count to go to zero.
>
>> >> +
>> >> +#  define lock_sysctl() __raw_spin_lock(&sysctl_lock.raw_lock)
>> >> +#  define unlock_sysctl() __raw_spin_unlock(&sysctl_lock.raw_lock)
>> >
>> > Uhmm, Please explain that -- without a proper explanation this is a NAK.
>> 
>> If the refcount is to be considered a lock.  sysctl_lock must be considered
>> the internals of that lock.  lockdep gets extremely confused otherwise.
>> Since the spinlock is static to this file I'm not especially worried
>> about it.
>
> Usually lock internal locks still get lockdep coverage. Let see if we
> can find a way for this to be true even here. I suspect the below to
> cause the issue:
>
>> >>  /* called under sysctl_lock, will reacquire if has to wait */
>> >> @@ -1478,47 +1531,54 @@ static void start_unregistering(struct ctl_table_header *p)
>> >>  	 * if p->used is 0, nobody will ever touch that entry again;
>> >>  	 * we'll eliminate all paths to it before dropping sysctl_lock
>> >>  	 */
>> >> +	table_acquire(p);
>> >>  	if (unlikely(p->used)) {
>> >>  		struct completion wait;
>> >> +		table_contended(p);
>> >> +
>> >>  		init_completion(&wait);
>> >>  		p->unregistering = &wait;
>> >> -		spin_unlock(&sysctl_lock);
>> >> +		unlock_sysctl();
>> >>  		wait_for_completion(&wait);
>> >> -		spin_lock(&sysctl_lock);
>> >> +		lock_sysctl();
>> >>  	} else {
>> >>  		/* anything non-NULL; we'll never dereference it */
>> >>  		p->unregistering = ERR_PTR(-EINVAL);
>> >>  	}
>> >> +	table_acquired(p);
>> >> +
>> >>  	/*
>> >>  	 * do not remove from the list until nobody holds it; walking the
>> >>  	 * list in do_sysctl() relies on that.
>> >>  	 */
>> >>  	list_del_init(&p->ctl_entry);
>> >> +
>> >> +	table_release(p);
>> >>  }
>
> There you acquire the table while holding the spinlock, generating:
> sysctl_lock -> table_lock, however you then release the sysctl_lock and
> re-acquire it, generating table_lock -> sysctl_lock.
>
> Humm, can't we write that differently?

That is an artifact of sysctl_lock being used to implement
table_lock as best as I can tell.  The case you point
out I could probably play with where I claim the lock
is acquired and make it work.

__sysctl_head_next on the read side is trickier.
We come in with table_lock held for read.
We grab sysctl_lock.
We release table_lock (aka the reference count is decremented)
We grab table_lock on the next table (aka the reference count is incremented)
We release sysctl_lock

If we generate lockdep annotations for that it would seem to transition
through the states:
table_lock
table_lock -> sysctl_lock
sysctl_lock
sysctl_lock -> table_lock
table_lock

Short of saying table_lock is an implementation detail.  Used to
make certain operations atomic I do not see how to model this case.

Let me take a slightly simpler case and ask how that gets modeled.
Looking at rwsem.  Ok all of the annotations are outside of the
spin_lock.  So in some sense we are sloppy, and fib to lockdep
about when the we acquire/release a lock.  In another sense
we are simply respecting the abstraction.

I guess I can take a look and see if I can model things a slightly
more lossy fashion so I don't need to do the __raw_spin_lock thing.


>> >> @@ -1951,7 +2011,13 @@ struct ctl_table_header *__register_sysctl_paths(
>> >>  		return NULL;
>> >>  	}
>> >>  #endif
>> >> -	spin_lock(&sysctl_lock);
>> >> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>> >> +	{
>> >> +		static struct lock_class_key __key;
>> >> +		lockdep_init_map(&header->dep_map, "sysctl_used", &__key, 0);
>> >> +	}
>> >> +#endif	
>> >
>> > This means every sysctl thingy gets the same class, is that
>> > intended/desired?
>> 
>> There is only one place we initialize it, and as far as I know really
>> only one place we take it.  Which is the definition of a lockdep
>> class as far as I know.
>
> Indeed, just checking.

The only difference I can possibly see is read side versus write side.
Or in my case refcount side versus wait side.

Eric


  reply	other threads:[~2009-03-31 22:44 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-12 13:21 [PATCH, resend] eliminate spurious pointless WARN_ON()s Jan Beulich
2009-03-12 13:48 ` Andi Kleen
2009-03-13  8:52   ` Peter Zijlstra
2009-03-13  1:39 ` [tip:core/ipi] generic-ipi: " Jan Beulich
2009-03-13  8:54   ` Peter Zijlstra
2009-03-13  9:21     ` [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s Jan Beulich
2009-03-13  9:43       ` Peter Zijlstra
2009-03-13 10:38         ` Ingo Molnar
2009-03-19 22:14           ` Eric W. Biederman
2009-03-20  8:52             ` Ingo Molnar
2009-03-20  9:58               ` Eric W. Biederman
2009-03-20 18:24                 ` Ingo Molnar
2009-03-20 18:52                   ` Peter Zijlstra
2009-03-20 19:34                     ` cpu hotplug and lockdep (was: Re: [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s) Peter Zijlstra
2009-03-21  7:39                       ` [PATCH 0/2] sysctl: lockdep support Eric W. Biederman
2009-03-21  7:40                         ` [PATCH 1/2] sysctl: Don't take the use count of multiple heads at a time Eric W. Biederman
2009-03-21  7:42                           ` [PATCH 2/2] sysctl: lockdep support for sysctl reference counting Eric W. Biederman
2009-03-30 22:26                             ` Andrew Morton
2009-03-30 22:53                               ` Eric W. Biederman
2009-03-30 23:18                                 ` Andrew Morton
2009-03-30 23:50                                   ` Eric W. Biederman
2009-03-31  8:10                               ` Peter Zijlstra
2009-03-31  8:47                                 ` Eric W. Biederman
2009-03-31  8:17                             ` Peter Zijlstra
2009-03-31 13:40                               ` Eric W. Biederman
2009-03-31 15:35                                 ` Peter Zijlstra
2009-03-31 22:44                                   ` Eric W. Biederman [this message]
2009-04-10  9:18                             ` Andrew Morton
2009-03-20 23:40                     ` [tip:core/ipi] generic-ipi: eliminate spurious pointlessWARN_ON()s Eric W. Biederman
2009-03-21 10:20                       ` Peter Zijlstra
2009-03-13  9:31     ` [tip:core/ipi] generic-ipi: eliminate spurious pointless WARN_ON()s Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] generic-ipi: eliminate WARN_ON()s during oops/panic Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic: decrease oops_in_progress only after having done the panic Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic, smp: provide smp_send_stop() wrapper on UP too Ingo Molnar
2009-03-13 10:36 ` [tip:core/ipi] panic: clean up kernel/panic.c Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1hc198g90.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ego@in.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jbeulich@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.