linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] scsi: fcoe: memleak fixes
@ 2018-07-31 13:46 Johannes Thumshirn
  2018-07-31 13:46 ` [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2018-07-31 13:46 UTC (permalink / raw)
  To: Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard,
	Johannes Thumshirn

Ard reported a memory leak in FCoE at [1] and this patch set contains
the fix for this leak, a second leak I discovered while hunting the
first leak and a use-after-free in FCoE's debugging.

[1] https://marc.info/?l=linux-scsi&m=153261165228133&w=2

Johannes Thumshirn (3):
  fcoe: fix use-after-free in fcoe_ctlr_els_send
  scsi: fcoe: drop frames in ELS LOGO error path
  scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO

 drivers/scsi/fcoe/fcoe_ctlr.c | 6 +++---
 drivers/scsi/libfc/fc_rport.c | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send
  2018-07-31 13:46 [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
@ 2018-07-31 13:46 ` Johannes Thumshirn
  2018-08-01  6:30   ` Hannes Reinecke
  2018-07-31 13:46 ` [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path Johannes Thumshirn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-07-31 13:46 UTC (permalink / raw)
  To: Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard,
	Johannes Thumshirn

KASAN reports a use-after-free in fcoe_ctlr_els_send() when we're
sending a LOGO and have FIP debugging enabled. This is because we're
first freeing the skb and then printing the frame's DID. But the DID
is a member of the FC frame header which in turn is the skb's payload.

Exchange the debug print and kfree_skb() calls so we're not touching
the freed data.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/fcoe/fcoe_ctlr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
index ea23c8dffc25..ceb35ebbeb8f 100644
--- a/drivers/scsi/fcoe/fcoe_ctlr.c
+++ b/drivers/scsi/fcoe/fcoe_ctlr.c
@@ -799,9 +799,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
 	fip->send(fip, skb);
 	return -EINPROGRESS;
 drop:
-	kfree_skb(skb);
 	LIBFCOE_FIP_DBG(fip, "drop els_send op %u d_id %x\n",
 			op, ntoh24(fh->fh_d_id));
+	kfree_skb(skb);
 	return -EINVAL;
 }
 EXPORT_SYMBOL(fcoe_ctlr_els_send);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path
  2018-07-31 13:46 [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
  2018-07-31 13:46 ` [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send Johannes Thumshirn
@ 2018-07-31 13:46 ` Johannes Thumshirn
  2018-08-01  6:30   ` Hannes Reinecke
  2018-07-31 13:46 ` [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO Johannes Thumshirn
  2018-08-06  9:25 ` [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
  3 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-07-31 13:46 UTC (permalink / raw)
  To: Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard,
	Johannes Thumshirn

Drop the frames in the ELS LOGO error path instead of just returning
an error.

This fixes the following kmemleak report:
unreferenced object 0xffff880064cb1000 (size 424):
  comm "kworker/0:2", pid 24, jiffies 4294904293 (age 68.504s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<(____ptrval____)>] _fc_frame_alloc+0x2c/0x180 [libfc]
    [<(____ptrval____)>] fc_lport_enter_logo+0x106/0x360 [libfc]
    [<(____ptrval____)>] fc_fabric_logoff+0x8c/0xc0 [libfc]
    [<(____ptrval____)>] fcoe_if_destroy+0x79/0x3b0 [fcoe]
    [<(____ptrval____)>] fcoe_destroy_work+0xd2/0x170 [fcoe]
    [<(____ptrval____)>] process_one_work+0x7ff/0x1420
    [<(____ptrval____)>] worker_thread+0x87/0xef0
    [<(____ptrval____)>] kthread+0x2db/0x390
    [<(____ptrval____)>] ret_from_fork+0x35/0x40
    [<(____ptrval____)>] 0xffffffffffffffff

which can be triggered by issuing
echo eth0 > /sys/bus/fcoe/ctlr_destroy

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/fcoe/fcoe_ctlr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
index ceb35ebbeb8f..ffec695e0bfb 100644
--- a/drivers/scsi/fcoe/fcoe_ctlr.c
+++ b/drivers/scsi/fcoe/fcoe_ctlr.c
@@ -754,9 +754,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
 	case ELS_LOGO:
 		if (fip->mode == FIP_MODE_VN2VN) {
 			if (fip->state != FIP_ST_VNMP_UP)
-				return -EINVAL;
+				goto drop;
 			if (ntoh24(fh->fh_d_id) == FC_FID_FLOGI)
-				return -EINVAL;
+				goto drop;
 		} else {
 			if (fip->state != FIP_ST_ENABLED)
 				return 0;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO
  2018-07-31 13:46 [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
  2018-07-31 13:46 ` [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send Johannes Thumshirn
  2018-07-31 13:46 ` [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path Johannes Thumshirn
@ 2018-07-31 13:46 ` Johannes Thumshirn
  2018-08-01  6:31   ` Hannes Reinecke
  2018-08-06  9:25 ` [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
  3 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-07-31 13:46 UTC (permalink / raw)
  To: Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard,
	Johannes Thumshirn

When receiving a LOGO request we forget to clear the FC_RP_STARTED
flag before starting the rport delete routine.

As the started flag was not cleared, we're not deleting the rport but
waiting for a restart and thus are keeping the reference count of the
rdata object at 1.

This leads to the following kmemleak report:
unreferenced object 0xffff88006542aa00 (size 512):
  comm "kworker/0:2", pid 24, jiffies 4294899222 (age 226.880s)
  hex dump (first 32 bytes):
    68 96 fe 65 00 88 ff ff 00 00 00 00 00 00 00 00  h..e............
    01 00 00 00 08 00 00 00 02 c5 45 24 ac b8 00 10  ..........E$....
  backtrace:
    [<(____ptrval____)>] fcoe_ctlr_vn_add.isra.5+0x7f/0x770 [libfcoe]
    [<(____ptrval____)>] fcoe_ctlr_vn_recv+0x12af/0x27f0 [libfcoe]
    [<(____ptrval____)>] fcoe_ctlr_recv_work+0xd01/0x32f0 [libfcoe]
    [<(____ptrval____)>] process_one_work+0x7ff/0x1420
    [<(____ptrval____)>] worker_thread+0x87/0xef0
    [<(____ptrval____)>] kthread+0x2db/0x390
    [<(____ptrval____)>] ret_from_fork+0x35/0x40
    [<(____ptrval____)>] 0xffffffffffffffff

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: ard <ard@kwaak.net>
---
 drivers/scsi/libfc/fc_rport.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
index 31d31aad3de1..89b1f1af2fd4 100644
--- a/drivers/scsi/libfc/fc_rport.c
+++ b/drivers/scsi/libfc/fc_rport.c
@@ -2164,6 +2164,7 @@ static void fc_rport_recv_logo_req(struct fc_lport *lport, struct fc_frame *fp)
 		FC_RPORT_DBG(rdata, "Received LOGO request while in state %s\n",
 			     fc_rport_state(rdata));
 
+		rdata->flags &= ~FC_RP_STARTED;
 		fc_rport_enter_delete(rdata, RPORT_EV_STOP);
 		mutex_unlock(&rdata->rp_mutex);
 		kref_put(&rdata->kref, fc_rport_destroy);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send
  2018-07-31 13:46 ` [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send Johannes Thumshirn
@ 2018-08-01  6:30   ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-08-01  6:30 UTC (permalink / raw)
  To: Johannes Thumshirn, Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> KASAN reports a use-after-free in fcoe_ctlr_els_send() when we're
> sending a LOGO and have FIP debugging enabled. This is because we're
> first freeing the skb and then printing the frame's DID. But the DID
> is a member of the FC frame header which in turn is the skb's payload.
> 
> Exchange the debug print and kfree_skb() calls so we're not touching
> the freed data.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/fcoe/fcoe_ctlr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
> index ea23c8dffc25..ceb35ebbeb8f 100644
> --- a/drivers/scsi/fcoe/fcoe_ctlr.c
> +++ b/drivers/scsi/fcoe/fcoe_ctlr.c
> @@ -799,9 +799,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
>  	fip->send(fip, skb);
>  	return -EINPROGRESS;
>  drop:
> -	kfree_skb(skb);
>  	LIBFCOE_FIP_DBG(fip, "drop els_send op %u d_id %x\n",
>  			op, ntoh24(fh->fh_d_id));
> +	kfree_skb(skb);
>  	return -EINVAL;
>  }
>  EXPORT_SYMBOL(fcoe_ctlr_els_send);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path
  2018-07-31 13:46 ` [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path Johannes Thumshirn
@ 2018-08-01  6:30   ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-08-01  6:30 UTC (permalink / raw)
  To: Johannes Thumshirn, Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> Drop the frames in the ELS LOGO error path instead of just returning
> an error.
> 
> This fixes the following kmemleak report:
> unreferenced object 0xffff880064cb1000 (size 424):
>   comm "kworker/0:2", pid 24, jiffies 4294904293 (age 68.504s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<(____ptrval____)>] _fc_frame_alloc+0x2c/0x180 [libfc]
>     [<(____ptrval____)>] fc_lport_enter_logo+0x106/0x360 [libfc]
>     [<(____ptrval____)>] fc_fabric_logoff+0x8c/0xc0 [libfc]
>     [<(____ptrval____)>] fcoe_if_destroy+0x79/0x3b0 [fcoe]
>     [<(____ptrval____)>] fcoe_destroy_work+0xd2/0x170 [fcoe]
>     [<(____ptrval____)>] process_one_work+0x7ff/0x1420
>     [<(____ptrval____)>] worker_thread+0x87/0xef0
>     [<(____ptrval____)>] kthread+0x2db/0x390
>     [<(____ptrval____)>] ret_from_fork+0x35/0x40
>     [<(____ptrval____)>] 0xffffffffffffffff
> 
> which can be triggered by issuing
> echo eth0 > /sys/bus/fcoe/ctlr_destroy
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/fcoe/fcoe_ctlr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
> index ceb35ebbeb8f..ffec695e0bfb 100644
> --- a/drivers/scsi/fcoe/fcoe_ctlr.c
> +++ b/drivers/scsi/fcoe/fcoe_ctlr.c
> @@ -754,9 +754,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
>  	case ELS_LOGO:
>  		if (fip->mode == FIP_MODE_VN2VN) {
>  			if (fip->state != FIP_ST_VNMP_UP)
> -				return -EINVAL;
> +				goto drop;
>  			if (ntoh24(fh->fh_d_id) == FC_FID_FLOGI)
> -				return -EINVAL;
> +				goto drop;
>  		} else {
>  			if (fip->state != FIP_ST_ENABLED)
>  				return 0;
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO
  2018-07-31 13:46 ` [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO Johannes Thumshirn
@ 2018-08-01  6:31   ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-08-01  6:31 UTC (permalink / raw)
  To: Johannes Thumshirn, Martin K . Petersen
  Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> When receiving a LOGO request we forget to clear the FC_RP_STARTED
> flag before starting the rport delete routine.
> 
> As the started flag was not cleared, we're not deleting the rport but
> waiting for a restart and thus are keeping the reference count of the
> rdata object at 1.
> 
> This leads to the following kmemleak report:
> unreferenced object 0xffff88006542aa00 (size 512):
>   comm "kworker/0:2", pid 24, jiffies 4294899222 (age 226.880s)
>   hex dump (first 32 bytes):
>     68 96 fe 65 00 88 ff ff 00 00 00 00 00 00 00 00  h..e............
>     01 00 00 00 08 00 00 00 02 c5 45 24 ac b8 00 10  ..........E$....
>   backtrace:
>     [<(____ptrval____)>] fcoe_ctlr_vn_add.isra.5+0x7f/0x770 [libfcoe]
>     [<(____ptrval____)>] fcoe_ctlr_vn_recv+0x12af/0x27f0 [libfcoe]
>     [<(____ptrval____)>] fcoe_ctlr_recv_work+0xd01/0x32f0 [libfcoe]
>     [<(____ptrval____)>] process_one_work+0x7ff/0x1420
>     [<(____ptrval____)>] worker_thread+0x87/0xef0
>     [<(____ptrval____)>] kthread+0x2db/0x390
>     [<(____ptrval____)>] ret_from_fork+0x35/0x40
>     [<(____ptrval____)>] 0xffffffffffffffff
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reported-by: ard <ard@kwaak.net>
> ---
>  drivers/scsi/libfc/fc_rport.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
> index 31d31aad3de1..89b1f1af2fd4 100644
> --- a/drivers/scsi/libfc/fc_rport.c
> +++ b/drivers/scsi/libfc/fc_rport.c
> @@ -2164,6 +2164,7 @@ static void fc_rport_recv_logo_req(struct fc_lport *lport, struct fc_frame *fp)
>  		FC_RPORT_DBG(rdata, "Received LOGO request while in state %s\n",
>  			     fc_rport_state(rdata));
>  
> +		rdata->flags &= ~FC_RP_STARTED;
>  		fc_rport_enter_delete(rdata, RPORT_EV_STOP);
>  		mutex_unlock(&rdata->rp_mutex);
>  		kref_put(&rdata->kref, fc_rport_destroy);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-07-31 13:46 [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2018-07-31 13:46 ` [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO Johannes Thumshirn
@ 2018-08-06  9:25 ` Johannes Thumshirn
  2018-08-06 13:22   ` ard
  2018-08-09  8:05   ` Johannes Thumshirn
  3 siblings, 2 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2018-08-06  9:25 UTC (permalink / raw)
  To: Martin K . Petersen; +Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

Ard,

Did you have time to test these patches?

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-06  9:25 ` [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
@ 2018-08-06 13:22   ` ard
  2018-08-06 13:27     ` Johannes Thumshirn
  2018-08-09  8:05   ` Johannes Thumshirn
  1 sibling, 1 reply; 19+ messages in thread
From: ard @ 2018-08-06 13:22 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

Hi,

Johannes,

On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
> Ard,
> 
> Did you have time to test these patches?

Ahh, yeah, sorry.
I did test them.
I can't say I see a regression.
I also can't say I see progress, although it feels like there is
a tad less leaking.
For that I need to do some statistics.

But for now I ack that it does not regress further.

-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-06 13:22   ` ard
@ 2018-08-06 13:27     ` Johannes Thumshirn
  2018-08-06 14:24       ` ard
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-08-06 13:27 UTC (permalink / raw)
  To: ard; +Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> Ahh, yeah, sorry.
> I did test them.
> I can't say I see a regression.
> I also can't say I see progress, although it feels like there is
> a tad less leaking.
> For that I need to do some statistics.
> 
> But for now I ack that it does not regress further.

From what I've seen in your kmemleak reports, it wasn't only
libfc/fcoe that was leaking memory.

So kmemleak reports are welcome (especially if they confirm libfc/fcoe
is clean now ;-)).

@Martin can you stage these patches for 4.19?

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-06 13:27     ` Johannes Thumshirn
@ 2018-08-06 14:24       ` ard
  2018-08-07  6:54         ` Johannes Thumshirn
  0 siblings, 1 reply; 19+ messages in thread
From: ard @ 2018-08-06 14:24 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

Hi Johannes,

On Mon, Aug 06, 2018 at 03:27:25PM +0200, Johannes Thumshirn wrote:
> On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> > Ahh, yeah, sorry.
> > I did test them.
> > I can't say I see a regression.
> > I also can't say I see progress, although it feels like there is
> > a tad less leaking.
> > For that I need to do some statistics.
> > 
> > But for now I ack that it does not regress further.
> 
> From what I've seen in your kmemleak reports, it wasn't only
> libfc/fcoe that was leaking memory.
> 
> So kmemleak reports are welcome (especially if they confirm libfc/fcoe
> is clean now ;-)).
I've updated the ticket:
https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
dmesg (with full debugging):
https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
and kmemleak:
https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt

I'll have to take a good look at the what and why.
I'm gonna git clean -x -d and rebuild the kernel, because this is
weird.

This is the tree I build:
https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
So vanilla linux-4.14 with the patch set:
https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
Meh, I should try to use git for patching.

I will spam some printk's later on, just to check which flows are
followed (if the git clean doesn't reveal some problem, although
I can't imagine).

Regards,
Ard

-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-06 14:24       ` ard
@ 2018-08-07  6:54         ` Johannes Thumshirn
  2018-08-07  9:26           ` ard
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-08-07  6:54 UTC (permalink / raw)
  To: ard; +Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

On Mon, Aug 06, 2018 at 04:24:14PM +0200, ard wrote:
> I've updated the ticket:
> https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
> dmesg (with full debugging):
> https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
> and kmemleak:
> https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt
> 
> I'll have to take a good look at the what and why.
> I'm gonna git clean -x -d and rebuild the kernel, because this is
> weird.
> 
> This is the tree I build:
> https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
> So vanilla linux-4.14 with the patch set:
> https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
> Meh, I should try to use git for patching.
> 
> I will spam some printk's later on, just to check which flows are
> followed (if the git clean doesn't reveal some problem, although
> I can't imagine).

OK, now this is wired. Are you seeing this on the initiator or on the
target side? Also on x86_64 or just the odroid? I could reproduce your
reports in my Virtualized Environment [1][2] by issuing deletes from the
initiator side..

[1] https://github.com/rapido-linux/rapido
[2] https://github.com/rapido-linux/rapido/blob/master/fcoe_local_autorun.sh

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-07  6:54         ` Johannes Thumshirn
@ 2018-08-07  9:26           ` ard
  2018-08-07  9:57             ` ard
  2018-08-07 16:04             ` ard
  0 siblings, 2 replies; 19+ messages in thread
From: ard @ 2018-08-07  9:26 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

Hi,

On Tue, Aug 07, 2018 at 08:54:00AM +0200, Johannes Thumshirn wrote:
> OK, now this is wired. Are you seeing this on the initiator or on the
> target side? Also on x86_64 or just the odroid? I could reproduce your
> reports in my Virtualized Environment [1][2] by issuing deletes from the
> initiator side..
Yes it is weird, and it is even more weird when I looked at the
collectd statistics:
The memory leak was almost none existent on my test odroid with
the PC turned off. When I turn it back on, it rises to 150MB/day
So it seems you need at least some party.
The most important thing to realise: this is pure vn2vn chatter.
There is no traffic going from or to the test odroid (to the test
pc there is some).
If I disable the FCoE vlan on the switch port, the chatter *and*
the memory leaks vanishes.
Meeh, this reports needs a better place than just e-mail, I got a
few nice graphs to show.

But here is an overview of my FCoE vlan:
(Sorted by hand)
(GS724Tv4) #show mac-addr-table vlan 11

Address Entries Currently in Use............... 89

   MAC Address     Interface     Status
-----------------  ---------  ------------
00:1E:06:30:05:50  g4         odroid4 Xu4/exynos 5422/4.4.0-rc6    stable (330 days up)
0E:FD:00:00:05:50  g4         Learned
00:1E:06:30:04:E0  g6         odroid6 Xu4/exynos 5422/4.9.28       stable (330 days up)
0E:FD:00:00:04:E0  g6         Learned
00:1E:06:30:05:52  g7         odroid7 Xu4/exynos 5422/4.14.55      leaking (150MB leak/day)
0E:FD:00:00:05:52  g7         Learned
00:0E:0C:B0:68:37  g14        storage SS4000E/Xscale 80219/3.7.1   stable (295 days up)
0E:FD:00:00:68:37  g14        Learned
00:14:FD:16:DD:50  g15        thecus1 n4200eco/D525/4.3.0          stable (295 days up)
0E:FD:00:00:DD:50  g15        Learned
00:24:1D:7F:40:88  g17        antec   PC/i7-920/4.14.59            leaking
0E:FD:00:00:40:88  g17        Learned


The system on G14 and G15 are both long time targets.
G4,6 and 7 (my production server is on 5 with FCoE and kmemleak, but with the
FCoE vlan removed) are odroids doing nothing more with FCoE but being there.
(Waiting for experiments for bcache on eMMC, I used to be able to
crash the FCoE *target* using btrfs on bcache on eMMC and FCoE.
(Target was running 4.0.0 back then).
Generic config (PC and odroid):
root@odroid6:~# cat /etc/network/interfaces.d/20-fcoe 
auto fcoe
iface fcoe inet manual
        pre-up modprobe fcoe || true
        pre-up ip link add link eth0 name fcoe type vlan id 11
        pre-up sysctl -w net.ipv6.conf.fcoe.disable_ipv6=1
        up ip link set up dev fcoe
        up sh -c 'echo fcoe > /sys/module/libfcoe/parameters/create_vn2vn'
        #up /root/mountfcoe
        #pre-down /root/stop-bcaches
        pre-down sh -c 'echo fcoe > /sys/module/libfcoe/parameters/destroy'
        down ip link set down dev fcoe
        down ip link del fcoe           

The targets are configured with some version of targetcli (so a
big echo shell script).

This is on the 4.14 systems:
root@antec:~# grep .  /sys/class/fc_*/*/port_*
/sys/class/fc_host/host10/port_id:0x004088
/sys/class/fc_host/host10/port_name:0x200000241d7f4088
/sys/class/fc_host/host10/port_state:Online
/sys/class/fc_host/host10/port_type:NPort (fabric via point-to-point)
/sys/class/fc_remote_ports/rport-10:0-0/port_id:0x00dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_name:0x20000014fd16dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-1/port_id:0x006837
/sys/class/fc_remote_ports/rport-10:0-1/port_name:0x2000000e0cb06837
/sys/class/fc_remote_ports/rport-10:0-1/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-2/port_id:0x000550
/sys/class/fc_remote_ports/rport-10:0-2/port_name:0x2000001e06300550
/sys/class/fc_remote_ports/rport-10:0-2/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-3/port_id:0x0004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_name:0x2000001e063004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_state:Online
/sys/class/fc_transport/target10:0:0/port_id:0x00dd50
/sys/class/fc_transport/target10:0:0/port_name:0x20000014fd16dd50

None of the other systems have an fc_transport, as they do not
have targets assigned to them (currently).
Notice that antec (PC) does not see odroid7.
The same is true vice versa.
All other systems see both antec and odroid7.

So they all can see eachother except for the 4.14 systems that
can't see eachother.

Now when I noticed that it only happened when my PC starts, I
wondered why it also happened when my PC is turned off, as I turn
it on once every few months and sometimes in the winter, it's
power usage is the same as the remaining systems combined.

And my next thing is: why did my production server seemed to die
less fast since a few kernel upgrades (in the 4.14 line).
I got it figured out now:
Before the heatwave, I had odroid5 turned on, my steam machine
(also with FCoE as an active initiator and 4.14 kernel) and the
PC turned off.  So that still makes 6 FCoE ports on the network.
When the summer came I needed to turn off the steam machine as
much as possible. This resulted in my main production server only
needing a reboot once ever week instead of every 2 days. I
attributed that to kernel fixes (as I knew there was a memory
leak, just didn't know where yet).

Thinking about that some more: do I need 4.14 systems to trigger
a bug within eachother, or is it pure the number of fc hosts
that should be bigger than 5 to trigger a bug in 4.14?

So a conclusion of my rambling:
1) you either need 6 vn2vn hosts *or* you need more than one 4.14
kernel in a network to trigger. One of the two. I need to think
about this. The fact that the 4.14 systems can't see eachother is
an indicator.  I can turn off the FCoE on some other system to
see if the memleak stops.
2) kernels up to 4.9.28 do not have a memoryleak. 4.14.28+ do
have the memory leak.
3) I need a place for graphs, I will see if I can abuse the
github ticket some more 8-D.
4) Just having FCoE enabled on an interface and
*receiving*/interacting with FCoE vn2vn chatter triggers the bug.
So that's only setting up the rports, maintaining ownership of
your port id.
5) The memleak itself is architecture independent and NIC
independent.


-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-07  9:26           ` ard
@ 2018-08-07  9:57             ` ard
  2018-08-07 16:04             ` ard
  1 sibling, 0 replies; 19+ messages in thread
From: ard @ 2018-08-07  9:57 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> 3) I need a place for graphs, I will see if I can abuse the
> github ticket some more 8-D.
https://github.com/hardkernel/linux/issues/360#issuecomment-411001362

Regards,
Ard van Breemen
-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-07  9:26           ` ard
  2018-08-07  9:57             ` ard
@ 2018-08-07 16:04             ` ard
  2018-08-09 10:01               ` ard
  1 sibling, 1 reply; 19+ messages in thread
From: ard @ 2018-08-07 16:04 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

Hi,

On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> So a conclusion of my rambling:
> 1) you either need 6 vn2vn hosts *or* you need more than one 4.14
> kernel in a network to trigger. One of the two. I need to think
> about this. The fact that the 4.14 systems can't see eachother is
> an indicator.  I can turn off the FCoE on some other system to
> see if the memleak stops.

I've gone all out:

Ok, deleted everything, long story short:
(for now(!))

PC+steam machine with 4.14 (patched) and 4.16 (upstream,
nodebug): no kmemleaks
Every device sees every device.

Now if I add two odroids running 4.14 (not patched):
The odroids will see *every* device running < 4.14 and vv
The odroids will not see any device running >= 4.14 and vv
The PC starts to kmemleak a lot.

I've removed the two 4.14 odroids again from the FCoE lan, and
upgraded them with a patched 4.14 .
I rebooted the PC with steam (4.16 upstream unpatched) and all
other systems in the lan I mentioned.
To be clear: the steam machine has a dedicated nic for FCoE. All
other systems use a vlan.
I will let that talk with eachother for now.

I fear there is a platform specific thing in the vn2vn chatter
that has changed > 4.9 .
4.14 can communicate with lower revisions, but they can't see
eachother, and whatever they chat about, it induces memory leaks
on the PC.
Oh, another difference is that steam and the PC has targets
assigned. I might try that for the 4.14 too.

Anyway, I will do some cooking and let the current systems chat
with eachother and see if we get a kmemleak making my statement
bogus.

Regards,
Ard

-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-06  9:25 ` [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
  2018-08-06 13:22   ` ard
@ 2018-08-09  8:05   ` Johannes Thumshirn
  2018-08-09  9:52     ` Martin K. Petersen
  1 sibling, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2018-08-09  8:05 UTC (permalink / raw)
  To: Martin K . Petersen; +Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
Martin,

though it looks like these patches aren't fixing all of the errors Ard
has, they fix all I could reproduce up to now and Ard already stated
they're not regressing.

So can we please stage them for 4.19?

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-09  8:05   ` Johannes Thumshirn
@ 2018-08-09  9:52     ` Martin K. Petersen
  2018-08-09  9:56       ` Johannes Thumshirn
  0 siblings, 1 reply; 19+ messages in thread
From: Martin K. Petersen @ 2018-08-09  9:52 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist,
	Linux SCSI Mailinglist, ard


Johannes,

> Martin,
>
> though it looks like these patches aren't fixing all of the errors Ard
> has, they fix all I could reproduce up to now and Ard already stated
> they're not regressing.
>
> So can we please stage them for 4.19?

I merged them last week.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-09  9:52     ` Martin K. Petersen
@ 2018-08-09  9:56       ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2018-08-09  9:56 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Linux Kernel Mailinglist, Linux SCSI Mailinglist, ard

On Thu, Aug 09, 2018 at 05:52:13AM -0400, Martin K. Petersen wrote:
> 
> Johannes,
> 
> > Martin,
> >
> > though it looks like these patches aren't fixing all of the errors Ard
> > has, they fix all I could reproduce up to now and Ard already stated
> > they're not regressing.
> >
> > So can we please stage them for 4.19?
> 
> I merged them last week.

Oh I'm sorry, missed that.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] scsi: fcoe: memleak fixes
  2018-08-07 16:04             ` ard
@ 2018-08-09 10:01               ` ard
  0 siblings, 0 replies; 19+ messages in thread
From: ard @ 2018-08-09 10:01 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Martin K . Petersen, Linux Kernel Mailinglist, Linux SCSI Mailinglist

Hi Guys,

On Tue, Aug 07, 2018 at 06:04:52PM +0200, ard wrote:
> PC+steam machine with 4.14 (patched) and 4.16 (upstream,
> nodebug): no kmemleaks
> Every device sees every device.

New day, new conflicting results.
Yay \0/.

As I did not trust the results, I redid the tests, and the same
tests gave some different results.
Before giving the results I've changed my stance on the bug:
The bug is not a regression in memory leak. As far as I can tell
now, the memory leaks were already there.
It's a regression in vn2vn enodes being able to PLOGI.
Since I've seen the steam machine and the PC setup an rport,
there must be some racy thing going on how the accept or reject
the PLOGI.
Now once it rejects, it will never succeed to accept, and the
relogin happens ad infinitum.
In this mode there are about 47 kmemleaks per 10 minutes.
I also notice that the kmemleaks takes a while to be detected or
to die out. So there are state timers involved that hold on to
the memory and after time out do not free it.
And another thing I noticed: When the pc and the steam machine
had a working rport, after a while the steam machine (4.16
unpatched) fc_timedout the rports to all nodes (so all nodes with
kernel < 4.14 too), and all with different timeouts, except the
one it has an fc_transport with.
So it's sole remaining rport was the "designated" target.
Currently I am compiling 4.9 with kmemleak to determine if that
exhibits the same leaks when disconnecting and reconnecting the
FCoE vlan.
This to determine if we have a single regression in just the
login handling or both.
I will add the dmesg's of a working rport, and a failing rport
later.

Regards,
Ard

-- 
.signature not found

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-08-09 10:01 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 13:46 [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
2018-07-31 13:46 ` [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send Johannes Thumshirn
2018-08-01  6:30   ` Hannes Reinecke
2018-07-31 13:46 ` [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path Johannes Thumshirn
2018-08-01  6:30   ` Hannes Reinecke
2018-07-31 13:46 ` [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO Johannes Thumshirn
2018-08-01  6:31   ` Hannes Reinecke
2018-08-06  9:25 ` [PATCH 0/3] scsi: fcoe: memleak fixes Johannes Thumshirn
2018-08-06 13:22   ` ard
2018-08-06 13:27     ` Johannes Thumshirn
2018-08-06 14:24       ` ard
2018-08-07  6:54         ` Johannes Thumshirn
2018-08-07  9:26           ` ard
2018-08-07  9:57             ` ard
2018-08-07 16:04             ` ard
2018-08-09 10:01               ` ard
2018-08-09  8:05   ` Johannes Thumshirn
2018-08-09  9:52     ` Martin K. Petersen
2018-08-09  9:56       ` Johannes Thumshirn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).