Re: [EXT] SRR response handling.

From: Himanshu Madhani <hmadhani@marvell.com>
To: "greg@enjellic.com" <greg@enjellic.com>, Quinn Tran <qutran@marvell.com>
Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: [EXT] SRR response handling.
Date: Wed, 4 Sep 2019 15:03:10 +0000	[thread overview]
Message-ID: <E589D5BE-7D13-4C66-ABD3-ACA354628310@marvell.com> (raw)
In-Reply-To: <201909032038.x83Kcpiu027109@wind.enjellic.com>

Adding correct Quinn and removing qlogic.com email ID. 

It's nice to hear from you Dr Greg __ 

We will look at the request and get back to you.

Thanks,
Himanshu 

On 9/3/19, 3:39 PM, "Dr. G.W. Wettstein" <greg@wind.enjellic.com> wrote:

    External Email

    ----------------------------------------------------------------------
    Good afternoon, I hope the week is going well for everyone.

    Himanshu/Quinn, it has been a while since we spoke but I think I have
    the right group targeted for this.  I see that you have evolved from
    Qlogic to Cavium to Marvell e-mail addresses, which must indicate the
    transition of who is making the hardware... :-)

    I've been away focused on other issues but I noted with interest that
    you are now distributing the Qlogic-Target/SCST interface driver that
    we developed with your 32GBPS hardware capable driver.  I'm pleased to
    see that it must have utility with respect to providing support for
    SCST on mainstream Linux.

    I wanted to bounce an issue off your collective judgement since you
    may be the only people that know the answer to this and we may be the
    only ones that can demonstrate and/or test the issue.

    In the following commit:

    commit 2c39b5c ("qla2xxx: Remove SRR code")

    Himanshu had pulled the SRR code handling out of the target driver.
    The changelog indicates that the code was removed since no one
    appeared to be using it but it could be re-added if there was a need
    for it.

    While SRR is ostensibly in support of tape, a block target initiator
    on an FCOE fabric will issue SRR's in response to frame drops on the
    underlying ethernet fabric.  In the following commit:

    commit 6f58c78 ("qla2xxx: Fix kernel panic on selective retransmission request")

    We fixed a bug that caused the Qlogic target driver to immediately
    panic the kernel if an SRR was received and debugging was enabled.  We
    found this bug secondary to debugging an issue with the fact that the
    Clipper ASIC's on Cisco Nexus 7K switches are not configured by
    default to support lossless fibre-channel frame transport.

    We have been using a version of the target driver from before when the
    SRR code was removed that has been performing flawlessly which is why
    we haven't spent time monkeying with upgrades.

    Secondary to a recent firmware upgrade on the Nexus switches that we
    conducted, we ended up in a situation where one of the ethernet
    interfaces in a port-channel began dropping frames at high rates.
    This caused one of our target servers, running a 4GBPS 2462 HBA in
    target mode, to immediately panic on what appears to be the BUG_ON
    assertion at the start of the scst_tgt_cmd_done() function.

    We were going to run this down when we noticed that you had officially
    married the Qlogic target driver with the SCST interface driver.  So
    we rolled a 4.9.190 kernel with the driver out of the SCST trunk to
    see how it handled all of this since we would prefer to be on
    something that you guys are directly and actively maintaining.

    We tested the resultant kernel and target driver and it appears to be
    functioning fine.  After thinking about things for a bit we elected to
    not expose the target server to the initiators on the other side of
    the port-channel which has the interface with the high frame discard
    rate.  The rationale for this is that we are unsure of what the
    behavior is going to be when a block initiator issues an SRR against a
    target server that has had the SRR handling code removed.

    We currently have three target servers that are handling thousands of
    SRR's from these initiators on a day in and day out basis with no ill
    effects, with the old driver that has SRR handling support.  One of
    the local patches we carry has debug code that prints out which
    initiator has issued an SRR.  So we know that the SRR handling code is
    being invoked and appears to be coping with the situation just fine,
    which is what is giving us pause with respect to exposing the new
    target server to those initiators.

    If the initiators throw an error that is well and good but we are
    obviously concerned about possible silent failures and/or corruption.

    Since all of this is so esoteric and uncommon we wanted to bounce this
    off the experts to get your sense on what we are dealing with.
    Obviously if we can help debug and/or improve the driver or provide a
    rationale for folding the SRR code back into the driver we would be
    more then happy to contribute and test.

    Let me know your thoughts and we will go from there.

    Have a good day.

    Dr. Greg

    As always,
    Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
    4206 N. 19th Ave.           Specializing in information infra-structure
    Fargo, ND  58102            development.
    PH: 701-281-1686            EMAIL: greg@enjellic.com
    ------------------------------------------------------------------------------
    "If you plugged up your nose and mouth right before you sneezed, would
     the sneeze go out your ears or would your head explode?  Either way I'm
     afraid to try."
                                    -- Nick Kean

    --