Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations

From: Uladzislau Rezki <urezki@gmail.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	Suraj Jitindar Singh <surajjs@amazon.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations
Date: Sun, 1 Mar 2020 12:08:43 +0100	[thread overview]
Message-ID: <20200301110843.GA8725@pc636> (raw)
In-Reply-To: <20200227133700.GC161459@google.com>

>
> Sorry for slightly late reply.
> 
The same, i am on the vacation since last Thursday and the whole
next week. Therefore will be delays due to restricted access
to my emails :)

> On Tue, Feb 25, 2020 at 07:54:00PM +0100, Uladzislau Rezki wrote:
> > > > > > I was thinking a 2 fold approach (just thinking out loud..):
> > > > > > 
> > > > > > If kfree_call_rcu() is called in atomic context or in any rcu reader, then
> > > > > > use GFP_ATOMIC to grow an rcu_head wrapper on the atomic memory pool and
> > > > > > queue that.
> > > > > > 
> > > > I am not sure if that is acceptable, i mean what to do when GFP_ATOMIC
> > > > gets failed in atomic context? Or we can just consider it as out of
> > > > memory and another variant is to say that headless object can be called
> > > > from preemptible context only.
> > > 
> > > Yes that makes sense, and we can always put disclaimer in the API's comments
> > > saying if this object is expected to be freed a lot, then don't use the
> > > headless-API to be extra safe.
> > > 
> > Agree.
> > 
> > > BTW, GFP_ATOMIC the documentation says if GFP_ATOMIC reserves are depleted,
> > > the kernel can even panic some times, so if GFP_ATOMIC allocation fails, then
> > > there seems to be bigger problems in the system any way. I would say let us
> > > write a patch to allocate there and see what the -mm guys think.
> > > 
> > OK. It might be that they can offer something if they do not like our
> > approach. I will try to compose something and send the patch to see.
> > The tree.c implementation is almost done, whereas tiny one is on hold.
> > 
> > I think we should support batching as well as bulk interface there.
> > Another way is to workaround head-less object, just to attach the head
> > dynamically using kmalloc() and then call_rcu() but then it will not be
> > a fair headless support :)
> > 
> > What is your view?
> 
> This kind of "head" will require backpointers to the original object as well
> right? And still wouldn't solve the "what if we run out of GFP_ATOMIC
> reserves". But let me know in a code snippet if possible about what you mean.
> 
Just to summarize. We would like to support head-less kvfree_rcu() interface.
It implies that we have only pure pointer that is passed and that is it. Therefore
we should maintain the dynamic arrays and place it there. Like we do for "bulk"
logic, building arrays chains. Or just attach the head for tiny version. I prefer
first variant because that is fair and will be aligned with tree RCU version.

If we can not maintain the array path, i mean under low memory condition, it makes
sense to try to attach a head(for array we allocate one page), i.e. to make an object
with rcu_head the same as ww would free regular object that contains rcu_head filed
in it: 

<snip>
static inline struct rcu_head *
attach_rcu_head_to_object(void *obj, gfp_t gfp)
{
    unsigned long *ptr;

    ptr = kmalloc(sizeof(unsigned long *) +
        sizeof(struct rcu_head), gfp);
    if (!ptr)
        return NULL;

    ptr[0] = (unsigned long) obj;
    return ((struct rcu_head *) ++ptr);
}
...
void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
...
    if (head == NULL && is_vmalloc_addr((void *) func)) {
        if (!vfree_call_rcu_add_ptr_to_bulk(krcp, (void *) func)) {
            head = attach_rcu_head_to_object((void *) func, GFP_ATOMIC);
            if (head) {
                /* Set the offset and tag the headless object. */
                func = (rcu_callback_t) (sizeof(unsigned long *) + 1);

                head->func = func;
                head->next = krcp->head;
                krcp->head = head;
   }

later on when freeing the headless object to witch we attached the head:

for (; head; head = next) {
...
  /* We tag the headless object, so check it. */
  if (!(((unsigned long) head - offset) & BIT(0))) {
   debug_rcu_head_unqueue(head);
  } else {
   offset -= 1;
   headless_ptr = (void *) head - offset;
  }
...
if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) {
   /*
    * here we kvfree() head-less object. The head was attached
    * due to low memory condition.
    */
   if (headless_ptr)
    kvfree((void *) *headless_ptr);

   kfree((void *)head - offset);
  }
<snip>

>
> And still wouldn't solve the "what if we run out of GFP_ATOMIC reserves".
>
It will not solve corner case. But it makes sense anyway to do because the
page allocator can say: no page, sorry, whereas slab can still serve our
request because we need only sizeof(rcu_head) + sizeof(unsigned long *)
bytes and not whole page.

Also when we detect low memory condition we should add force flag to schedule
the "monitor work" right away:

<snip>
 if (force)
     schedule_delayed_work(&krcp->monitor_work, 0);
<snip>

> > > > > > Otherwise, grow an rcu_head on the stack of kfree_call_rcu() and call
> > > > > > synchronize_rcu() inline with it.
> > > > > > 
> > > > > >
> > > > What do you mean here, Joel? "grow an rcu_head on the stack"?
> > > 
> > > By "grow on the stack", use the compiler-allocated rcu_head on the
> > > kfree_rcu() caller's stack.
> > > 
> > > I meant here to say, if we are not in atomic context, then we use regular
> > > GFP_KERNEL allocation, and if that fails, then we just use the stack's
> > > rcu_head and call synchronize_rcu() or even synchronize_rcu_expedited since
> > > the allocation failure would mean the need for RCU to free some memory is
> > > probably great.
> > > 
> > Ah, i got it. I thought you meant something like recursion and then
> > unwinding the stack back somehow :)
> 
> Yeah something like that :) Use the compiler allocated space which you
> wouldn't run out of unless stack overflows.
> 
Hmm... Please show it here, because i am a bit confused how to do that :)

> > > > As for "task_struct's rcu_read_lock_nesting". Will it be enough just
> > > > have a look at preempt_count of current process? If we have for example
> > > > nested rcu_read_locks:
> > > > 
> > > > <snip>
> > > > rcu_read_lock()
> > > >     rcu_read_lock()
> > > >         rcu_read_lock()
> > > > <snip>
> > > > 
> > > > the counter would be 3.
> > > 
> > > No, because preempt_count is not incremented during rcu_read_lock(). RCU
> > > reader sections can be preempted, they just cannot goto sleep in a reader
> > > section (unless the kernel is RT).
> > > 
> > So in CONFIG_PREEMPT kernel we can identify if we are in atomic or not by
> > using rcu_preempt_depth() and in_atomic(). When it comes to !CONFIG_PREEMPT
> > then we skip it and consider as atomic. Something like:
> > 
> > <snip>
> > static bool is_current_in_atomic()
> 
> Would be good to change this to is_current_in_rcu_reader() since
> rcu_preempt_depth() does not imply atomicity.
>
can_current_synchronize_rcu()? If can we just call:

<snip>
    synchronize_rcu() or synchronize_rcu_expedited();
    kvfree();
<snip>

> > {
> > #ifdef CONFIG_PREEMPT_RCU
> >     if (!rcu_preempt_depth() && !in_atomic())
> >         return false;
> 
> I think use if (!rcu_preempt_depth() && preemptible()) here.
> 
> preemptible() checks for IRQ disabled section as well.
> 
Yes but in_atomic() does it as well, it also checks other atomic
contexts like softirq handlers and NMI ones. So calling there
synchronize_rcu() is not allowed.

Thank you, Joel :)

--
Vlad Rezki