On Thu, 2012-11-29 at 13:29 -0500, chas williams - CONTRACTOR wrote:
> On Thu, 29 Nov 2012 18:11:48 +0000
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > We do see the 'packet received for unknown VCC' complaint, after we
> > reboot the host without resetting the card. And as shown in the commit I
> > just referenced, we rely on the !ATM_VF_READY check in order to prevent
> > a use-after-free when the tasklet is sending packets up a VCC that's
> > just been closed.
> 
> well that behavior is just crap.  why is it delivering cells for a
> vpi/vci pair that is not open?

In the reboot case... Because it *was* open and the device has no way of
knowing that the host just rebooted. We don't reset the card on loading
the driver, because that would cause an ADSL resync.

Perhaps we could send a 'close all VCCs' command to the firmware though.
Nathan, could we add that to the firmware?

Or we could just respond to any unwanted incoming packet by sending a
close for that specific VCC. And be careful about potential races with
open().

In the close case... The *tasklet* is running to process an incoming
packet, finds an open and active VCC and is *about* to send a packet up
to it. Meanwhile, our close() runs and the VCC is destroyed. And then
the tasklet... oops, use-after-free and crash.

Hence commit 1f6ea6e51 which makes the tasklet refuse to process RX for
a VCC which doesn't have the ATM_VF_READY flag set. Because it knows
it's *being* closed. And the close() routine waits for any *existing*
tasklet run to finish, to make sure nobody's referencing the VCC, before
it returns and allows vcc_destroy_socket() to complete. It's the RCU
principle, basically.

> > You mean that (e.g.) pppoatm_push(vcc, NULL) should be waiting for any
> > pending TX skb (that has already been passed off to the driver) to
> > *complete*? How would it even do that?
> 
> i think the order of the vcc_destroy_socket() operations is a bit
> wrong.  it should call detach protocol (i.e. push a NULL).  this should
> cause the attached protocol to stop any future sends and receives, and
> it CAN sleep in this context (and only this context -- generally
> sending cannot sleep which is why this might seem confusing) to do
> whatever it needs to do to wait for the attached protocol to clean up
> queues, flush data etc.
> 
> then the driver close() should be called.  this takes care of cleaning
> up any pending tx or rx that is in the hardware.  and of course,
> close() can sleep since it will be in a interrutible context.

This is basically what Krzysztof's patch 1/7 was doing, which we've now
dropped from the series.

There are issues with *either* ordering.

The current case is that we call vcc->dev->ops->close(vcc) *first*,
before vcc->push(vcc, NULL). And that means that the device is told to
close the VCC while the protocol may still be pushing packets to it.
Hence the patches to both pppoatm and br2684 to make them check for
ATM_VF_READY and *stop* pushing packets if it's not set.

If we flip it round and tell the protocol first, then it tears down all
its data structures while the driver is still happily calling its
->pop() on transmitted skbs. Which leads to the panic in br2684_pop()
that we've *also* seen (because we weren't actually flushing the TX
packets in the driver's close(), which had the same effect). Yes, you
suggest that the protocol could keep track of the skbs it's sent down
and wait for them... but surely it's better to let the driver get called
first and *abort* them?

At this point, I think we're better off as we are (with Krzysztof's
patch 1/7 dropped, and leaving vcc->dev->ops->close() being called
before vcc->push(NULL). We've fairly much solved the issues with that
arrangement, by checking ATM_VF_READY in the protocols' ->push()
functions.

-- 
dwmw2