Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected

All of lore.kernel.org
 help / color / mirror / Atom feed

* Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
@ 2021-08-05 20:53 Martin Zaharinov
  2021-08-06  4:40 ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-05 20:53 UTC (permalink / raw)
  To: netdev, gregkh, Eric Dumazet

Hi Net dev team


Please check this error :
Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html

But not find any solution.

Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.

Server is work fine users is down/up 500+ users .
But in one moment server make spike and affect other vlans in same server .
And in accel I see many row with this error.

Is there options to find and fix this bug.

With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.


[2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected


Martin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-05 20:53 Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected Martin Zaharinov
@ 2021-08-06  4:40 ` Greg KH
  2021-08-06  5:40   ` Martin Zaharinov
  2021-08-08 15:14   ` Martin Zaharinov
  0 siblings, 2 replies; 23+ messages in thread
From: Greg KH @ 2021-08-06  4:40 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: netdev, Eric Dumazet

On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> Hi Net dev team
> 
> 
> Please check this error :
> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
> 
> But not find any solution.
> 
> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
> 
> Server is work fine users is down/up 500+ users .
> But in one moment server make spike and affect other vlans in same server .
> And in accel I see many row with this error.
> 
> Is there options to find and fix this bug.
> 
> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
> 
> 
> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected

These are userspace error messages, not kernel messages.

What kernel version are you using?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-06  4:40 ` Greg KH
@ 2021-08-06  5:40   ` Martin Zaharinov
  2021-08-08 15:14   ` Martin Zaharinov
  1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-06  5:40 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, Eric Dumazet

Hi Greg

Latest kernel 5.13.8.

I try old version from 5.10 to 5.13 and its same error.

Martin

> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>> Hi Net dev team
>> 
>> 
>> Please check this error :
>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>> 
>> But not find any solution.
>> 
>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
>> 
>> Server is work fine users is down/up 500+ users .
>> But in one moment server make spike and affect other vlans in same server .
>> And in accel I see many row with this error.
>> 
>> Is there options to find and fix this bug.
>> 
>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
>> 
>> 
>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> 
> These are userspace error messages, not kernel messages.
> 
> What kernel version are you using?
> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-06  4:40 ` Greg KH
  2021-08-06  5:40   ` Martin Zaharinov
@ 2021-08-08 15:14   ` Martin Zaharinov
  2021-08-08 15:23     ` Pali Rohár
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-08 15:14 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, Eric Dumazet, pali

Add Pali Rohár,

If have any idea .

Martin

> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>> Hi Net dev team
>> 
>> 
>> Please check this error :
>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>> 
>> But not find any solution.
>> 
>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
>> 
>> Server is work fine users is down/up 500+ users .
>> But in one moment server make spike and affect other vlans in same server .
>> And in accel I see many row with this error.
>> 
>> Is there options to find and fix this bug.
>> 
>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
>> 
>> 
>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> 
> These are userspace error messages, not kernel messages.
> 
> What kernel version are you using?
> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-08 15:14   ` Martin Zaharinov
@ 2021-08-08 15:23     ` Pali Rohár
  2021-08-08 15:29       ` Martin Zaharinov
  0 siblings, 1 reply; 23+ messages in thread
From: Pali Rohár @ 2021-08-08 15:23 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Greg KH, netdev, Eric Dumazet

Hello!

On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
> Add Pali Rohár,
> 
> If have any idea .
> 
> Martin
> 
> > On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> > 
> > On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> >> Hi Net dev team
> >> 
> >> 
> >> Please check this error :
> >> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
> >> 
> >> But not find any solution.
> >> 
> >> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
> >> 
> >> Server is work fine users is down/up 500+ users .
> >> But in one moment server make spike and affect other vlans in same server .

When this error started to happen? After kernel upgrade? After pppd
upgrade? Or after system upgrade? Or when more users started to
connecting?

> >> And in accel I see many row with this error.
> >> 
> >> Is there options to find and fix this bug.
> >> 
> >> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
> >> 
> >> 
> >> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> > 
> > These are userspace error messages, not kernel messages.
> > 
> > What kernel version are you using?

Yes, we need to know, what kernel version are you using.

> > thanks,
> > 
> > greg k-h
> 

And also another question, what version of pppd daemon are you using?

Also, are you able to dump state of ppp channels and ppp units? It is
needed to know to which tty device, file descriptor (or socket
extension) is (or should be) particular ppp channel bounded.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-08 15:23     ` Pali Rohár
@ 2021-08-08 15:29       ` Martin Zaharinov
  2021-08-09 15:15         ` Pali Rohár
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-08 15:29 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Greg KH, netdev, Eric Dumazet

Hi Pali

Kernel 5.13.8


The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12

I use accel-pppd daemon (not pppd) .

And yes after users started to connecting .

When system boot and connect first time all user connect without any problem .
In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users  and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection. 
And in this time in log get many lines with   ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.

Martin

> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
> 
> Hello!
> 
> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>> Add Pali Rohár,
>> 
>> If have any idea .
>> 
>> Martin
>> 
>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>> 
>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>> Hi Net dev team
>>>> 
>>>> 
>>>> Please check this error :
>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>> 
>>>> But not find any solution.
>>>> 
>>>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
>>>> 
>>>> Server is work fine users is down/up 500+ users .
>>>> But in one moment server make spike and affect other vlans in same server .
> 
> When this error started to happen? After kernel upgrade? After pppd
> upgrade? Or after system upgrade? Or when more users started to
> connecting?
> 
>>>> And in accel I see many row with this error.
>>>> 
>>>> Is there options to find and fix this bug.
>>>> 
>>>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>> 
>>>> 
>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> 
>>> These are userspace error messages, not kernel messages.
>>> 
>>> What kernel version are you using?
> 
> Yes, we need to know, what kernel version are you using.
> 
>>> thanks,
>>> 
>>> greg k-h
>> 
> 
> And also another question, what version of pppd daemon are you using?
> 
> Also, are you able to dump state of ppp channels and ppp units? It is
> needed to know to which tty device, file descriptor (or socket
> extension) is (or should be) particular ppp channel bounded.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-08 15:29       ` Martin Zaharinov
@ 2021-08-09 15:15         ` Pali Rohár
  2021-08-10 18:27           ` Martin Zaharinov
  2021-08-11 11:10           ` Martin Zaharinov
  0 siblings, 2 replies; 23+ messages in thread
From: Pali Rohár @ 2021-08-09 15:15 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Greg KH, netdev, Eric Dumazet

On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
> Hi Pali
> 
> Kernel 5.13.8
> 
> 
> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
> 
> I use accel-pppd daemon (not pppd) .

I'm not using accel-pppd, so cannot help here.

I would suggest to try "git bisect" kernel version which started to be
problematic for accel-pppd.

Providing state of ppp channels and ppp units could help to debug this
issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
process which has ppp file descriptors can retrieve and dump this
information.

> And yes after users started to connecting .
> 
> When system boot and connect first time all user connect without any problem .
> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users  and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection. 
> And in this time in log get many lines with   ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
> 
> Martin
> 
> > On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
> > 
> > Hello!
> > 
> > On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
> >> Add Pali Rohár,
> >> 
> >> If have any idea .
> >> 
> >> Martin
> >> 
> >>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
> >>> 
> >>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
> >>>> Hi Net dev team
> >>>> 
> >>>> 
> >>>> Please check this error :
> >>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
> >>>> 
> >>>> But not find any solution.
> >>>> 
> >>>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
> >>>> 
> >>>> Server is work fine users is down/up 500+ users .
> >>>> But in one moment server make spike and affect other vlans in same server .
> > 
> > When this error started to happen? After kernel upgrade? After pppd
> > upgrade? Or after system upgrade? Or when more users started to
> > connecting?
> > 
> >>>> And in accel I see many row with this error.
> >>>> 
> >>>> Is there options to find and fix this bug.
> >>>> 
> >>>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
> >>>> 
> >>>> 
> >>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> 
> >>> These are userspace error messages, not kernel messages.
> >>> 
> >>> What kernel version are you using?
> > 
> > Yes, we need to know, what kernel version are you using.
> > 
> >>> thanks,
> >>> 
> >>> greg k-h
> >> 
> > 
> > And also another question, what version of pppd daemon are you using?
> > 
> > Also, are you able to dump state of ppp channels and ppp units? It is
> > needed to know to which tty device, file descriptor (or socket
> > extension) is (or should be) particular ppp channel bounded.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-09 15:15         ` Pali Rohár
@ 2021-08-10 18:27           ` Martin Zaharinov
  2021-08-11 16:40             ` Guillaume Nault
  2021-08-11 11:10           ` Martin Zaharinov
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-10 18:27 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Greg KH, netdev, Eric Dumazet, Guillaume Nault

Add Guillaume Nault

> On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
> 
> On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
>> Hi Pali
>> 
>> Kernel 5.13.8
>> 
>> 
>> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
>> 
>> I use accel-pppd daemon (not pppd) .
> 
> I'm not using accel-pppd, so cannot help here.
> 
> I would suggest to try "git bisect" kernel version which started to be
> problematic for accel-pppd.
> 
> Providing state of ppp channels and ppp units could help to debug this
> issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
> process which has ppp file descriptors can retrieve and dump this
> information.
> 
>> And yes after users started to connecting .
>> 
>> When system boot and connect first time all user connect without any problem .
>> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users  and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection. 
>> And in this time in log get many lines with   ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
>> 
>> Martin
>> 
>>> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
>>> 
>>> Hello!
>>> 
>>> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>>>> Add Pali Rohár,
>>>> 
>>>> If have any idea .
>>>> 
>>>> Martin
>>>> 
>>>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>>> 
>>>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>>>> Hi Net dev team
>>>>>> 
>>>>>> 
>>>>>> Please check this error :
>>>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>>>> 
>>>>>> But not find any solution.
>>>>>> 
>>>>>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
>>>>>> 
>>>>>> Server is work fine users is down/up 500+ users .
>>>>>> But in one moment server make spike and affect other vlans in same server .
>>> 
>>> When this error started to happen? After kernel upgrade? After pppd
>>> upgrade? Or after system upgrade? Or when more users started to
>>> connecting?
>>> 
>>>>>> And in accel I see many row with this error.
>>>>>> 
>>>>>> Is there options to find and fix this bug.
>>>>>> 
>>>>>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>>>> 
>>>>>> 
>>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>> 
>>>>> These are userspace error messages, not kernel messages.
>>>>> 
>>>>> What kernel version are you using?
>>> 
>>> Yes, we need to know, what kernel version are you using.
>>> 
>>>>> thanks,
>>>>> 
>>>>> greg k-h
>>>> 
>>> 
>>> And also another question, what version of pppd daemon are you using?
>>> 
>>> Also, are you able to dump state of ppp channels and ppp units? It is
>>> needed to know to which tty device, file descriptor (or socket
>>> extension) is (or should be) particular ppp channel bounded.
>> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-09 15:15         ` Pali Rohár
  2021-08-10 18:27           ` Martin Zaharinov
@ 2021-08-11 11:10           ` Martin Zaharinov
  2021-08-11 16:48             ` Guillaume Nault
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-08-11 11:10 UTC (permalink / raw)
  To: Pali Rohár, Guillaume Nault; +Cc: Greg KH, netdev, Eric Dumazet

And one more that see.

Problem is come when accel start finishing sessions,
Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
May be kernel destroy old session slow and entrained other users by locking other sessions.
is there a way to speed up the closing of stopped/dead sessions.

Martin

> On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
> 
> On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
>> Hi Pali
>> 
>> Kernel 5.13.8
>> 
>> 
>> The problem is from kernel 5.8 > I try all major update 5.9, 5.10, 5.11 ,5.12
>> 
>> I use accel-pppd daemon (not pppd) .
> 
> I'm not using accel-pppd, so cannot help here.
> 
> I would suggest to try "git bisect" kernel version which started to be
> problematic for accel-pppd.
> 
> Providing state of ppp channels and ppp units could help to debug this
> issue, but I'm not sure if accel-pppd has this debug feature. IIRC only
> process which has ppp file descriptors can retrieve and dump this
> information.
> 
>> And yes after users started to connecting .
>> 
>> When system boot and connect first time all user connect without any problem .
>> In time of work user disconnect and connect (power cut , fiber cut or other problem in network) , but in time of spike (may be make lock or other problem ) disconnect ~ 400-500 users  and affect other users. Process go to load over 100% and In statistic I see many finishing connection and many start connection. 
>> And in this time in log get many lines with   ioctl(PPPIOCCONNECT): Transport endpoint is not connected. After finish (unlock or other) stop to see this error and system is back to normal. And connect all disconnected users.
>> 
>> Martin
>> 
>>> On 8 Aug 2021, at 18:23, Pali Rohár <pali@kernel.org> wrote:
>>> 
>>> Hello!
>>> 
>>> On Sunday 08 August 2021 18:14:09 Martin Zaharinov wrote:
>>>> Add Pali Rohár,
>>>> 
>>>> If have any idea .
>>>> 
>>>> Martin
>>>> 
>>>>> On 6 Aug 2021, at 7:40, Greg KH <gregkh@linuxfoundation.org> wrote:
>>>>> 
>>>>> On Thu, Aug 05, 2021 at 11:53:50PM +0300, Martin Zaharinov wrote:
>>>>>> Hi Net dev team
>>>>>> 
>>>>>> 
>>>>>> Please check this error :
>>>>>> Last time I write for this problem : https://www.spinics.net/lists/netdev/msg707513.html
>>>>>> 
>>>>>> But not find any solution.
>>>>>> 
>>>>>> Config of server is : Bonding port channel (LACP)  > Accel PPP server > Huawei switch.
>>>>>> 
>>>>>> Server is work fine users is down/up 500+ users .
>>>>>> But in one moment server make spike and affect other vlans in same server .
>>> 
>>> When this error started to happen? After kernel upgrade? After pppd
>>> upgrade? Or after system upgrade? Or when more users started to
>>> connecting?
>>> 
>>>>>> And in accel I see many row with this error.
>>>>>> 
>>>>>> Is there options to find and fix this bug.
>>>>>> 
>>>>>> With accel team I discus this problem  and they claim it is kernel bug and need to find solution with Kernel dev team.
>>>>>> 
>>>>>> 
>>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>>>> 
>>>>> These are userspace error messages, not kernel messages.
>>>>> 
>>>>> What kernel version are you using?
>>> 
>>> Yes, we need to know, what kernel version are you using.
>>> 
>>>>> thanks,
>>>>> 
>>>>> greg k-h
>>>> 
>>> 
>>> And also another question, what version of pppd daemon are you using?
>>> 
>>> Also, are you able to dump state of ppp channels and ppp units? It is
>>> needed to know to which tty device, file descriptor (or socket
>>> extension) is (or should be) particular ppp channel bounded.
>> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-10 18:27           ` Martin Zaharinov
@ 2021-08-11 16:40             ` Guillaume Nault
  0 siblings, 0 replies; 23+ messages in thread
From: Guillaume Nault @ 2021-08-11 16:40 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

On Tue, Aug 10, 2021 at 09:27:14PM +0300, Martin Zaharinov wrote:
> Add Guillaume Nault
> 
> > On 9 Aug 2021, at 18:15, Pali Rohár <pali@kernel.org> wrote:
> > 
> > On Sunday 08 August 2021 18:29:30 Martin Zaharinov wrote:
> >>>>>> [2021-08-05 13:52:05.294] vlan912: 24b205903d09718e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:05.298] vlan912: 24b205903d097162: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:05.626] vlan641: 24b205903d09711b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:11.000] vlan912: 24b205903d097105: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:17.852] vlan912: 24b205903d0971ae: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:21.113] vlan641: 24b205903d09715b: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:27.963] vlan912: 24b205903d09718d: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:30.249] vlan496: 24b205903d097184: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:30.992] vlan420: 24b205903d09718a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:33.937] vlan640: 24b205903d0971cd: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:40.032] vlan912: 24b205903d097182: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:40.420] vlan912: 24b205903d0971d5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:42.799] vlan912: 24b205903d09713a: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:42.799] vlan614: 24b205903d0971e5: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.102] vlan912: 24b205903d097190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097153: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.850] vlan479: 24b205903d097141: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.852] vlan912: 24b205903d097198: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:43.977] vlan637: 24b205903d097148: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>>>>> [2021-08-05 13:52:44.528] vlan637: 24b205903d0971c3: ioctl(PPPIOCCONNECT): Transport endpoint is not connected

The PPPIOCCONNECT ioctl returns -ENOTCONN if the ppp channel has been
unregistered.

From a user space point of view, this means that accel-ppp establishes
PPPoE sessions, starts negociating PPP connection parameters on top of
them (LCP and authentication) and finally the PPPoE sessions get
disconnected before accel-ppp connects them to ppp units (units are
roughly the "pppX" network devices).

Unregistration of PPPoE channels can happen for the following reasons:

  * Changing some parameters of the network interface used by the
    PPPoE connection: MAC address, MTU, bringing the device down.

  * Reception of a PADT (PPPoE disconnection message sent from the peer).

  * Closing the PPPoE socket.

  * Re-connecting a PPPoE socket with a different session ID (this
    unregisters the previous channel and creates a new one, so that
    shouldn't be the problem you're facing here).

Given that this seems to affect all PPPoE connections, I guess
something happened to the underlying network interface (1st bullet
point).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-11 11:10           ` Martin Zaharinov
@ 2021-08-11 16:48             ` Guillaume Nault
  2021-09-07  6:16               ` Martin Zaharinov
  0 siblings, 1 reply; 23+ messages in thread
From: Guillaume Nault @ 2021-08-11 16:48 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
> And one more that see.
> 
> Problem is come when accel start finishing sessions,
> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
> May be kernel destroy old session slow and entrained other users by locking other sessions.
> is there a way to speed up the closing of stopped/dead sessions.

What are the CPU stats when that happen? Is it users space or kernel
space that keeps it busy?

One easy way to check is to run "mpstat 1" for a few seconds when the
problem occurs.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-08-11 16:48             ` Guillaume Nault
@ 2021-09-07  6:16               ` Martin Zaharinov
  2021-09-07  6:42                 ` Martin Zaharinov
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-07  6:16 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

Hi 
Sorry for delay but not easy to catch moment .


See this is mpstatl 1 :

Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)

11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26


I attache and one screenshot from perf top (Screenshot is send on preview mail)

And I see in lsmod 

pppoe                  20480  8198
pppox                  16384  1 pppoe
ppp_generic            45056  16364 pppox,pppoe
slhc                   16384  1 ppp_generic

To slow remove pppoe session .

And from log : 

[2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
[2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected

> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
> 
> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>> And one more that see.
>> 
>> Problem is come when accel start finishing sessions,
>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>> is there a way to speed up the closing of stopped/dead sessions.
> 
> What are the CPU stats when that happen? Is it users space or kernel
> space that keeps it busy?
> 
> One easy way to check is to run "mpstat 1" for a few seconds when the
> problem occurs.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-07  6:16               ` Martin Zaharinov
@ 2021-09-07  6:42                 ` Martin Zaharinov
  2021-09-11  6:26                   ` Martin Zaharinov
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-07  6:42 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

Perf top from text


PerfTop:   28391 irqs/sec  kernel:98.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    17.01%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
     9.73%  [kernel]                 [k] mutex_spin_on_owner
     9.07%  [pppoe]                  [k] pppoe_rcv
     2.77%  [nf_nat]                 [k] device_cmp
     1.66%  [kernel]                 [k] osq_lock
     1.65%  [kernel]                 [k] _raw_spin_lock
     1.61%  [kernel]                 [k] __local_bh_enable_ip
     1.35%  [nf_nat]                 [k] inet_cmp
     1.30%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
     1.16%  [kernel]                 [k] menu_select
     0.99%  [kernel]                 [k] cpuidle_enter_state
     0.96%  [ixgbe]                  [k] ixgbe_clean_rx_irq
     0.86%  [kernel]                 [k] __dev_queue_xmit
     0.70%  [kernel]                 [k] __cond_resched
     0.69%  [sch_cake]               [k] cake_dequeue
     0.67%  [nf_tables]              [k] nft_do_chain
     0.63%  [kernel]                 [k] rcu_all_qs
     0.61%  [kernel]                 [k] fib_table_lookup
     0.57%  [kernel]                 [k] __schedule
     0.57%  [kernel]                 [k] skb_release_data
     0.54%  [kernel]                 [k] sched_clock
     0.54%  [kernel]                 [k] __copy_skb_header
     0.53%  [kernel]                 [k] dev_queue_xmit_nit
     0.53%  [kernel]                 [k] _raw_spin_lock_irqsave
     0.50%  [kernel]                 [k] kmem_cache_free
     0.48%  libfrr.so.0.0.0          [.] 0x00000000000ce970
     0.47%  [ixgbe]                  [k] ixgbe_clean_tx_irq
     0.45%  [kernel]                 [k] timerqueue_add
     0.45%  [kernel]                 [k] lapic_next_deadline
     0.45%  [kernel]                 [k] csum_partial_copy_generic
     0.44%  [nf_flow_table]          [k] nf_flow_offload_ip_hook
     0.44%  [kernel]                 [k] kmem_cache_alloc
     0.44%  [nf_conntrack]           [k] nf_conntrack_lock

> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
> 
> Hi 
> Sorry for delay but not easy to catch moment .
> 
> 
> See this is mpstatl 1 :
> 
> Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)
> 
> 11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
> 11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
> 11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
> 11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
> 11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
> 11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
> 11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
> 11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
> 11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
> 11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
> 11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
> 11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
> 11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
> 11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
> 11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
> 11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
> Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26
> 
> 
> I attache and one screenshot from perf top (Screenshot is send on preview mail)
> 
> And I see in lsmod 
> 
> pppoe                  20480  8198
> pppox                  16384  1 pppoe
> ppp_generic            45056  16364 pppox,pppoe
> slhc                   16384  1 ppp_generic
> 
> To slow remove pppoe session .
> 
> And from log : 
> 
> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> 
>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>> 
>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>> And one more that see.
>>> 
>>> Problem is come when accel start finishing sessions,
>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>> is there a way to speed up the closing of stopped/dead sessions.
>> 
>> What are the CPU stats when that happen? Is it users space or kernel
>> space that keeps it busy?
>> 
>> One easy way to check is to run "mpstat 1" for a few seconds when the
>> problem occurs.
>> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-07  6:42                 ` Martin Zaharinov
@ 2021-09-11  6:26                   ` Martin Zaharinov
  2021-09-14  6:16                     ` Martin Zaharinov
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-11  6:26 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

Hi Guillaume

Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
But how to find where is a problem any locking or other.
And is there options to make fast remove ppp interface from kernel to reduce this load.


Martin

> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
> 
> Perf top from text
> 
> 
> PerfTop:   28391 irqs/sec  kernel:98.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>    17.01%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
>     9.73%  [kernel]                 [k] mutex_spin_on_owner
>     9.07%  [pppoe]                  [k] pppoe_rcv
>     2.77%  [nf_nat]                 [k] device_cmp
>     1.66%  [kernel]                 [k] osq_lock
>     1.65%  [kernel]                 [k] _raw_spin_lock
>     1.61%  [kernel]                 [k] __local_bh_enable_ip
>     1.35%  [nf_nat]                 [k] inet_cmp
>     1.30%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
>     1.16%  [kernel]                 [k] menu_select
>     0.99%  [kernel]                 [k] cpuidle_enter_state
>     0.96%  [ixgbe]                  [k] ixgbe_clean_rx_irq
>     0.86%  [kernel]                 [k] __dev_queue_xmit
>     0.70%  [kernel]                 [k] __cond_resched
>     0.69%  [sch_cake]               [k] cake_dequeue
>     0.67%  [nf_tables]              [k] nft_do_chain
>     0.63%  [kernel]                 [k] rcu_all_qs
>     0.61%  [kernel]                 [k] fib_table_lookup
>     0.57%  [kernel]                 [k] __schedule
>     0.57%  [kernel]                 [k] skb_release_data
>     0.54%  [kernel]                 [k] sched_clock
>     0.54%  [kernel]                 [k] __copy_skb_header
>     0.53%  [kernel]                 [k] dev_queue_xmit_nit
>     0.53%  [kernel]                 [k] _raw_spin_lock_irqsave
>     0.50%  [kernel]                 [k] kmem_cache_free
>     0.48%  libfrr.so.0.0.0          [.] 0x00000000000ce970
>     0.47%  [ixgbe]                  [k] ixgbe_clean_tx_irq
>     0.45%  [kernel]                 [k] timerqueue_add
>     0.45%  [kernel]                 [k] lapic_next_deadline
>     0.45%  [kernel]                 [k] csum_partial_copy_generic
>     0.44%  [nf_flow_table]          [k] nf_flow_offload_ip_hook
>     0.44%  [kernel]                 [k] kmem_cache_alloc
>     0.44%  [nf_conntrack]           [k] nf_conntrack_lock
> 
>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
>> 
>> Hi 
>> Sorry for delay but not easy to catch moment .
>> 
>> 
>> See this is mpstatl 1 :
>> 
>> Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)
>> 
>> 11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
>> 11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
>> 11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
>> 11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
>> 11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
>> 11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
>> 11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
>> 11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
>> 11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
>> 11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
>> 11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
>> 11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
>> 11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
>> 11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
>> 11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
>> 11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
>> Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26
>> 
>> 
>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>> 
>> And I see in lsmod 
>> 
>> pppoe                  20480  8198
>> pppox                  16384  1 pppoe
>> ppp_generic            45056  16364 pppox,pppoe
>> slhc                   16384  1 ppp_generic
>> 
>> To slow remove pppoe session .
>> 
>> And from log : 
>> 
>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> 
>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>>> 
>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>> And one more that see.
>>>> 
>>>> Problem is come when accel start finishing sessions,
>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>> is there a way to speed up the closing of stopped/dead sessions.
>>> 
>>> What are the CPU stats when that happen? Is it users space or kernel
>>> space that keeps it busy?
>>> 
>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>> problem occurs.
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-11  6:26                   ` Martin Zaharinov
@ 2021-09-14  6:16                     ` Martin Zaharinov
  2021-09-14  8:02                       ` Guillaume Nault
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14  6:16 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

Hi Nault

See this stats :

Linux 5.14.2 (testb)   09/14/21        _x86_64_        (12 CPU)

11:33:44     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:33:45     all    1.75    0.00   18.85    0.00    0.00    5.00    0.00    0.00    0.00   74.40
11:33:46     all    1.74    0.00   17.88    0.00    0.00    4.72    0.00    0.00    0.00   75.66
11:33:47     all    2.23    0.00   17.62    0.00    0.00    5.05    0.00    0.00    0.00   75.10
11:33:48     all    1.82    0.00   13.64    0.00    0.00    5.70    0.00    0.00    0.00   78.84
11:33:49     all    1.50    0.00   13.46    0.00    0.00    5.15    0.00    0.00    0.00   79.90
11:33:50     all    3.06    0.00   13.96    0.00    0.00    4.79    0.00    0.00    0.00   78.20
11:33:51     all    1.40    0.00   16.53    0.00    0.00    5.21    0.00    0.00    0.00   76.86
11:33:52     all    4.43    0.00   19.44    0.00    0.00    6.56    0.00    0.00    0.00   69.57
11:33:53     all    1.51    0.00   16.40    0.00    0.00    4.77    0.00    0.00    0.00   77.32
11:33:54     all    1.51    0.00   16.55    0.00    0.00    4.71    0.00    0.00    0.00   77.23
11:33:55     all    1.00    0.00   13.21    0.00    0.00    5.90    0.00    0.00    0.00   79.90
Average:     all    2.00    0.00   16.14    0.00    0.00    5.23    0.00    0.00    0.00   76.63


  PerfTop:   28046 irqs/sec  kernel:96.3%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    23.37%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
    17.76%  [kernel]                 [k] mutex_spin_on_owner
     9.47%  [pppoe]                  [k] pppoe_rcv
     7.71%  [kernel]                 [k] osq_lock
     2.77%  [nf_nat]                 [k] inet_cmp
     2.59%  [nf_nat]                 [k] device_cmp
     2.55%  [kernel]                 [k] __local_bh_enable_ip
     2.04%  [kernel]                 [k] _raw_spin_lock
     1.23%  [kernel]                 [k] __cond_resched
     1.16%  [kernel]                 [k] rcu_all_qs
     1.13%  libfrr.so.0.0.0          [.] 0x00000000000ce970
     0.79%  [nf_conntrack]           [k] nf_conntrack_lock
     0.75%  libfrr.so.0.0.0          [.] 0x00000000000ce94e
     0.53%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
     0.46%  [kernel]                 [k] fib_table_lookup
     0.46%  [ip_tables]              [k] ipt_do_table
     0.45%  [ixgbe]                  [k] ixgbe_clean_rx_irq
     0.37%  [kernel]                 [k] __dev_queue_xmit
     0.34%  [nf_conntrack]           [k] __nf_conntrack_find_get.isra.0
     0.33%  [ixgbe]                  [k] ixgbe_clean_tx_irq
     0.30%  [kernel]                 [k] menu_select
     0.25%  [kernel]                 [k] vlan_do_receive
     0.21%  [kernel]                 [k] ip_finish_output2
     0.21%  [ixgbe]                  [k] ixgbe_poll
     0.20%  [kernel]                 [k] _raw_spin_lock_irqsave
     0.19%  [kernel]                 [k] get_rps_cpu
     0.19%  libc.so.6                [.] 0x0000000000186afa
     0.19%  [kernel]                 [k] queued_read_lock_slowpath
     0.19%  [kernel]                 [k] do_poll.constprop.0
     0.19%  [kernel]                 [k] cpuidle_enter_state
     0.18%  [kernel]                 [k] dev_hard_start_xmit
     0.18%  [kernel]                 [k] ___slab_alloc.constprop.0
     0.17%  zebra                    [.] 0x00000000000b9271
     0.16%  [kernel]                 [k] csum_partial_copy_generic
     0.16%  zebra                    [.] 0x00000000000b91f1
     0.16%  [kernel]                 [k] page_frag_free
     0.16%  [kernel]                 [k] kmem_cache_alloc
     0.15%  [kernel]                 [k] __skb_flow_dissect
     0.15%  [kernel]                 [k] sched_clock
     0.15%  libc.so.6                [.] 0x00000000000965a2
     0.15%  [kernel]                 [k] kmem_cache_free_bulk.part.0
     0.15%  [pppoe]                  [k] pppoe_flush_dev
     0.15%  [ixgbe]                  [k] ixgbe_tx_map
     0.14%  [kernel]                 [k] _raw_spin_lock_bh
     0.14%  [kernel]                 [k] fib_table_flush
     0.14%  [kernel]                 [k] native_irq_return_iret
     0.14%  [kernel]                 [k] __dev_xmit_skb
     0.13%  [kernel]                 [k] nf_hook_slow
     0.13%  [kernel]                 [k] fib_lookup_good_nhc
     0.12%  [kernel]                 [k] __fget_files
     0.12%  [kernel]                 [k] process_backlog
     0.12%  [xt_dtvqos]              [k] 0x00000000000008d1
     0.12%  [kernel]                 [k] __list_del_entry_valid
     0.12%  [kernel]                 [k] skb_release_data
     0.12%  [kernel]                 [k] ip_route_input_slow
     0.11%  [kernel]                 [k] netif_skb_features
     0.11%  [kernel]                 [k] sock_poll
     0.11%  [kernel]                 [k] __schedule
     0.11%  [kernel]                 [k] __softirqentry_text_start


And on time of problem when try to write : ip a 
to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.

In case need to know why system is overloaded when deconfig ppp interface.


Best regards,
Martin




> On 11 Sep 2021, at 9:26, Martin Zaharinov <micron10@gmail.com> wrote:
> 
> Hi Guillaume
> 
> Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
> But how to find where is a problem any locking or other.
> And is there options to make fast remove ppp interface from kernel to reduce this load.
> 
> 
> Martin
> 
>> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
>> 
>> Perf top from text
>> 
>> 
>> PerfTop:   28391 irqs/sec  kernel:98.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>   17.01%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
>>    9.73%  [kernel]                 [k] mutex_spin_on_owner
>>    9.07%  [pppoe]                  [k] pppoe_rcv
>>    2.77%  [nf_nat]                 [k] device_cmp
>>    1.66%  [kernel]                 [k] osq_lock
>>    1.65%  [kernel]                 [k] _raw_spin_lock
>>    1.61%  [kernel]                 [k] __local_bh_enable_ip
>>    1.35%  [nf_nat]                 [k] inet_cmp
>>    1.30%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
>>    1.16%  [kernel]                 [k] menu_select
>>    0.99%  [kernel]                 [k] cpuidle_enter_state
>>    0.96%  [ixgbe]                  [k] ixgbe_clean_rx_irq
>>    0.86%  [kernel]                 [k] __dev_queue_xmit
>>    0.70%  [kernel]                 [k] __cond_resched
>>    0.69%  [sch_cake]               [k] cake_dequeue
>>    0.67%  [nf_tables]              [k] nft_do_chain
>>    0.63%  [kernel]                 [k] rcu_all_qs
>>    0.61%  [kernel]                 [k] fib_table_lookup
>>    0.57%  [kernel]                 [k] __schedule
>>    0.57%  [kernel]                 [k] skb_release_data
>>    0.54%  [kernel]                 [k] sched_clock
>>    0.54%  [kernel]                 [k] __copy_skb_header
>>    0.53%  [kernel]                 [k] dev_queue_xmit_nit
>>    0.53%  [kernel]                 [k] _raw_spin_lock_irqsave
>>    0.50%  [kernel]                 [k] kmem_cache_free
>>    0.48%  libfrr.so.0.0.0          [.] 0x00000000000ce970
>>    0.47%  [ixgbe]                  [k] ixgbe_clean_tx_irq
>>    0.45%  [kernel]                 [k] timerqueue_add
>>    0.45%  [kernel]                 [k] lapic_next_deadline
>>    0.45%  [kernel]                 [k] csum_partial_copy_generic
>>    0.44%  [nf_flow_table]          [k] nf_flow_offload_ip_hook
>>    0.44%  [kernel]                 [k] kmem_cache_alloc
>>    0.44%  [nf_conntrack]           [k] nf_conntrack_lock
>> 
>>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
>>> 
>>> Hi 
>>> Sorry for delay but not easy to catch moment .
>>> 
>>> 
>>> See this is mpstatl 1 :
>>> 
>>> Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)
>>> 
>>> 11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
>>> 11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
>>> 11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
>>> 11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
>>> 11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
>>> 11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
>>> 11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
>>> 11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
>>> 11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
>>> 11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
>>> 11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
>>> 11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
>>> 11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
>>> 11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
>>> 11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
>>> 11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
>>> Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26
>>> 
>>> 
>>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>>> 
>>> And I see in lsmod 
>>> 
>>> pppoe                  20480  8198
>>> pppox                  16384  1 pppoe
>>> ppp_generic            45056  16364 pppox,pppoe
>>> slhc                   16384  1 ppp_generic
>>> 
>>> To slow remove pppoe session .
>>> 
>>> And from log : 
>>> 
>>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> 
>>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
>>>> 
>>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>>> And one more that see.
>>>>> 
>>>>> Problem is come when accel start finishing sessions,
>>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>>> is there a way to speed up the closing of stopped/dead sessions.
>>>> 
>>>> What are the CPU stats when that happen? Is it users space or kernel
>>>> space that keeps it busy?
>>>> 
>>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>>> problem occurs.
>>>> 
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14  6:16                     ` Martin Zaharinov
@ 2021-09-14  8:02                       ` Guillaume Nault
  2021-09-14  9:50                         ` Florian Westphal
  0 siblings, 1 reply; 23+ messages in thread
From: Guillaume Nault @ 2021-09-14  8:02 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Pali Rohár, Greg KH, netdev, Eric Dumazet

On Tue, Sep 14, 2021 at 09:16:55AM +0300, Martin Zaharinov wrote:
> Hi Nault
> 
> See this stats :
> 
> Linux 5.14.2 (testb)   09/14/21        _x86_64_        (12 CPU)
> 
> 11:33:44     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 11:33:45     all    1.75    0.00   18.85    0.00    0.00    5.00    0.00    0.00    0.00   74.40
> 11:33:46     all    1.74    0.00   17.88    0.00    0.00    4.72    0.00    0.00    0.00   75.66
> 11:33:47     all    2.23    0.00   17.62    0.00    0.00    5.05    0.00    0.00    0.00   75.10
> 11:33:48     all    1.82    0.00   13.64    0.00    0.00    5.70    0.00    0.00    0.00   78.84
> 11:33:49     all    1.50    0.00   13.46    0.00    0.00    5.15    0.00    0.00    0.00   79.90
> 11:33:50     all    3.06    0.00   13.96    0.00    0.00    4.79    0.00    0.00    0.00   78.20
> 11:33:51     all    1.40    0.00   16.53    0.00    0.00    5.21    0.00    0.00    0.00   76.86
> 11:33:52     all    4.43    0.00   19.44    0.00    0.00    6.56    0.00    0.00    0.00   69.57
> 11:33:53     all    1.51    0.00   16.40    0.00    0.00    4.77    0.00    0.00    0.00   77.32
> 11:33:54     all    1.51    0.00   16.55    0.00    0.00    4.71    0.00    0.00    0.00   77.23
> 11:33:55     all    1.00    0.00   13.21    0.00    0.00    5.90    0.00    0.00    0.00   79.90
> Average:     all    2.00    0.00   16.14    0.00    0.00    5.23    0.00    0.00    0.00   76.63
> 
> 
>   PerfTop:   28046 irqs/sec  kernel:96.3%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>     23.37%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
>     17.76%  [kernel]                 [k] mutex_spin_on_owner
>      9.47%  [pppoe]                  [k] pppoe_rcv
>      7.71%  [kernel]                 [k] osq_lock
>      2.77%  [nf_nat]                 [k] inet_cmp
>      2.59%  [nf_nat]                 [k] device_cmp
>      2.55%  [kernel]                 [k] __local_bh_enable_ip
>      2.04%  [kernel]                 [k] _raw_spin_lock
>      1.23%  [kernel]                 [k] __cond_resched
>      1.16%  [kernel]                 [k] rcu_all_qs
>      1.13%  libfrr.so.0.0.0          [.] 0x00000000000ce970
>      0.79%  [nf_conntrack]           [k] nf_conntrack_lock
>      0.75%  libfrr.so.0.0.0          [.] 0x00000000000ce94e
>      0.53%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
>      0.46%  [kernel]                 [k] fib_table_lookup
>      0.46%  [ip_tables]              [k] ipt_do_table
>      0.45%  [ixgbe]                  [k] ixgbe_clean_rx_irq
>      0.37%  [kernel]                 [k] __dev_queue_xmit
>      0.34%  [nf_conntrack]           [k] __nf_conntrack_find_get.isra.0
>      0.33%  [ixgbe]                  [k] ixgbe_clean_tx_irq
>      0.30%  [kernel]                 [k] menu_select
>      0.25%  [kernel]                 [k] vlan_do_receive
>      0.21%  [kernel]                 [k] ip_finish_output2
>      0.21%  [ixgbe]                  [k] ixgbe_poll
>      0.20%  [kernel]                 [k] _raw_spin_lock_irqsave
>      0.19%  [kernel]                 [k] get_rps_cpu
>      0.19%  libc.so.6                [.] 0x0000000000186afa
>      0.19%  [kernel]                 [k] queued_read_lock_slowpath
>      0.19%  [kernel]                 [k] do_poll.constprop.0
>      0.19%  [kernel]                 [k] cpuidle_enter_state
>      0.18%  [kernel]                 [k] dev_hard_start_xmit
>      0.18%  [kernel]                 [k] ___slab_alloc.constprop.0
>      0.17%  zebra                    [.] 0x00000000000b9271
>      0.16%  [kernel]                 [k] csum_partial_copy_generic
>      0.16%  zebra                    [.] 0x00000000000b91f1
>      0.16%  [kernel]                 [k] page_frag_free
>      0.16%  [kernel]                 [k] kmem_cache_alloc
>      0.15%  [kernel]                 [k] __skb_flow_dissect
>      0.15%  [kernel]                 [k] sched_clock
>      0.15%  libc.so.6                [.] 0x00000000000965a2
>      0.15%  [kernel]                 [k] kmem_cache_free_bulk.part.0
>      0.15%  [pppoe]                  [k] pppoe_flush_dev
>      0.15%  [ixgbe]                  [k] ixgbe_tx_map
>      0.14%  [kernel]                 [k] _raw_spin_lock_bh
>      0.14%  [kernel]                 [k] fib_table_flush
>      0.14%  [kernel]                 [k] native_irq_return_iret
>      0.14%  [kernel]                 [k] __dev_xmit_skb
>      0.13%  [kernel]                 [k] nf_hook_slow
>      0.13%  [kernel]                 [k] fib_lookup_good_nhc
>      0.12%  [kernel]                 [k] __fget_files
>      0.12%  [kernel]                 [k] process_backlog
>      0.12%  [xt_dtvqos]              [k] 0x00000000000008d1
>      0.12%  [kernel]                 [k] __list_del_entry_valid
>      0.12%  [kernel]                 [k] skb_release_data
>      0.12%  [kernel]                 [k] ip_route_input_slow
>      0.11%  [kernel]                 [k] netif_skb_features
>      0.11%  [kernel]                 [k] sock_poll
>      0.11%  [kernel]                 [k] __schedule
>      0.11%  [kernel]                 [k] __softirqentry_text_start
> 
> 
> And on time of problem when try to write : ip a 
> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.

Probably some contention on the rtnl lock.

> In case need to know why system is overloaded when deconfig ppp interface.

Does it help if you disable conntrack?

> 
> Best regards,
> Martin
> 
> 
> 
> 
> > On 11 Sep 2021, at 9:26, Martin Zaharinov <micron10@gmail.com> wrote:
> > 
> > Hi Guillaume
> > 
> > Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
> > But how to find where is a problem any locking or other.
> > And is there options to make fast remove ppp interface from kernel to reduce this load.
> > 
> > 
> > Martin
> > 
> >> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@gmail.com> wrote:
> >> 
> >> Perf top from text
> >> 
> >> 
> >> PerfTop:   28391 irqs/sec  kernel:98.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> 
> >>   17.01%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
> >>    9.73%  [kernel]                 [k] mutex_spin_on_owner
> >>    9.07%  [pppoe]                  [k] pppoe_rcv
> >>    2.77%  [nf_nat]                 [k] device_cmp
> >>    1.66%  [kernel]                 [k] osq_lock
> >>    1.65%  [kernel]                 [k] _raw_spin_lock
> >>    1.61%  [kernel]                 [k] __local_bh_enable_ip
> >>    1.35%  [nf_nat]                 [k] inet_cmp
> >>    1.30%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
> >>    1.16%  [kernel]                 [k] menu_select
> >>    0.99%  [kernel]                 [k] cpuidle_enter_state
> >>    0.96%  [ixgbe]                  [k] ixgbe_clean_rx_irq
> >>    0.86%  [kernel]                 [k] __dev_queue_xmit
> >>    0.70%  [kernel]                 [k] __cond_resched
> >>    0.69%  [sch_cake]               [k] cake_dequeue
> >>    0.67%  [nf_tables]              [k] nft_do_chain
> >>    0.63%  [kernel]                 [k] rcu_all_qs
> >>    0.61%  [kernel]                 [k] fib_table_lookup
> >>    0.57%  [kernel]                 [k] __schedule
> >>    0.57%  [kernel]                 [k] skb_release_data
> >>    0.54%  [kernel]                 [k] sched_clock
> >>    0.54%  [kernel]                 [k] __copy_skb_header
> >>    0.53%  [kernel]                 [k] dev_queue_xmit_nit
> >>    0.53%  [kernel]                 [k] _raw_spin_lock_irqsave
> >>    0.50%  [kernel]                 [k] kmem_cache_free
> >>    0.48%  libfrr.so.0.0.0          [.] 0x00000000000ce970
> >>    0.47%  [ixgbe]                  [k] ixgbe_clean_tx_irq
> >>    0.45%  [kernel]                 [k] timerqueue_add
> >>    0.45%  [kernel]                 [k] lapic_next_deadline
> >>    0.45%  [kernel]                 [k] csum_partial_copy_generic
> >>    0.44%  [nf_flow_table]          [k] nf_flow_offload_ip_hook
> >>    0.44%  [kernel]                 [k] kmem_cache_alloc
> >>    0.44%  [nf_conntrack]           [k] nf_conntrack_lock
> >> 
> >>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@gmail.com> wrote:
> >>> 
> >>> Hi 
> >>> Sorry for delay but not easy to catch moment .
> >>> 
> >>> 
> >>> See this is mpstatl 1 :
> >>> 
> >>> Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)
> >>> 
> >>> 11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >>> 11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
> >>> 11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
> >>> 11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
> >>> 11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
> >>> 11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
> >>> 11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
> >>> 11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
> >>> 11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
> >>> 11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
> >>> 11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
> >>> 11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
> >>> 11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
> >>> 11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
> >>> 11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
> >>> 11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
> >>> 11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
> >>> Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26
> >>> 
> >>> 
> >>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
> >>> 
> >>> And I see in lsmod 
> >>> 
> >>> pppoe                  20480  8198
> >>> pppox                  16384  1 pppoe
> >>> ppp_generic            45056  16364 pppox,pppoe
> >>> slhc                   16384  1 ppp_generic
> >>> 
> >>> To slow remove pppoe session .
> >>> 
> >>> And from log : 
> >>> 
> >>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
> >>> 
> >>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@redhat.com> wrote:
> >>>> 
> >>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
> >>>>> And one more that see.
> >>>>> 
> >>>>> Problem is come when accel start finishing sessions,
> >>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
> >>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
> >>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
> >>>>> is there a way to speed up the closing of stopped/dead sessions.
> >>>> 
> >>>> What are the CPU stats when that happen? Is it users space or kernel
> >>>> space that keeps it busy?
> >>>> 
> >>>> One easy way to check is to run "mpstat 1" for a few seconds when the
> >>>> problem occurs.
> >>>> 
> >>> 
> >> 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14  8:02                       ` Guillaume Nault
@ 2021-09-14  9:50                         ` Florian Westphal
  2021-09-14 10:01                           ` Martin Zaharinov
  2021-09-14 10:53                           ` Martin Zaharinov
  0 siblings, 2 replies; 23+ messages in thread
From: Florian Westphal @ 2021-09-14  9:50 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: Martin Zaharinov, Pali Rohár, Greg KH, netdev, Eric Dumazet

Guillaume Nault <gnault@redhat.com> wrote:
> > And on time of problem when try to write : ip a 
> > to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
> 
> Probably some contention on the rtnl lock.

Yes, I'll create a patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14  9:50                         ` Florian Westphal
@ 2021-09-14 10:01                           ` Martin Zaharinov
  2021-09-14 11:00                             ` Florian Westphal
  2021-09-14 10:53                           ` Martin Zaharinov
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14 10:01 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Guillaume Nault, Pali Rohár, Greg KH, netdev, Eric Dumazet

Hi Nault and Florian

Nault : 

No not test need conntrack to log user traffic.

Florian: 

If you make patch send to test please.


Martin

> On 14 Sep 2021, at 12:50, Florian Westphal <fw@strlen.de> wrote:
> 
> Guillaume Nault <gnault@redhat.com> wrote:
>>> And on time of problem when try to write : ip a 
>>> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
>> 
>> Probably some contention on the rtnl lock.
> 
> Yes, I'll create a patch.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14  9:50                         ` Florian Westphal
  2021-09-14 10:01                           ` Martin Zaharinov
@ 2021-09-14 10:53                           ` Martin Zaharinov
  1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-14 10:53 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Guillaume Nault, Pali Rohár, Greg KH, netdev, Eric Dumazet

Florian Hi 
One more 

please see i try to remove nf_nat and xt_MASQUERADE 

and on time of problem need 50-80 sec to remove and overload system .

see perf from this moment: 




 PerfTop:    1738 irqs/sec  kernel:85.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    40.63%  [nf_conntrack]    [k] nf_ct_iterate_cleanup
    21.23%  [kernel]          [k] __local_bh_enable_ip
    10.93%  [kernel]          [k] __cond_resched
     9.20%  [kernel]          [k] _raw_spin_lock
     8.91%  [kernel]          [k] rcu_all_qs
     5.83%  [nf_conntrack]    [k] nf_conntrack_lock
     0.10%  [kernel]          [k] mutex_spin_on_owner
     0.08%  telegraf          [.] 0x0000000000021bf0
     0.06%  [kernel]          [k] osq_lock
     0.06%  [kernel]          [k] kallsyms_expand_symbol.constprop.0
     0.05%  [kernel]          [k] format_decode
     0.04%  [kernel]          [k] rtnl_fill_ifinfo.constprop.0.isra.0
     0.04%  perf              [.] 0x00000000000bc7b3
     0.04%  [kernel]          [k] memcpy_erms
     0.03%  [kernel]          [k] string
     0.03%  [kernel]          [k] menu_select
     0.03%  [kernel]          [k] nla_put
     0.03%  [kernel]          [k] vsnprintf



Martin

> On 14 Sep 2021, at 12:50, Florian Westphal <fw@strlen.de> wrote:
> 
> Guillaume Nault <gnault@redhat.com> wrote:
>>> And on time of problem when try to write : ip a 
>>> to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.
>> 
>> Probably some contention on the rtnl lock.
> 
> Yes, I'll create a patch.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14 10:01                           ` Martin Zaharinov
@ 2021-09-14 11:00                             ` Florian Westphal
  2021-09-15 14:25                               ` Martin Zaharinov
  2021-09-16 20:00                               ` Martin Zaharinov
  0 siblings, 2 replies; 23+ messages in thread
From: Florian Westphal @ 2021-09-14 11:00 UTC (permalink / raw)
  To: Martin Zaharinov; +Cc: Florian Westphal, Guillaume Nault, netdev

[-- Attachment #1: Type: text/plain, Size: 238 bytes --]

Martin Zaharinov <micron10@gmail.com> wrote:

[ Trimming CC list ]

> Florian: 
> 
> If you make patch send to test please.

Attached.  No idea if it helps, but 'ip' should stay responsive
even when masquerade processes netdevice events.

[-- Attachment #2: defer_masq_work.diff --]
[-- Type: text/x-diff, Size: 6674 bytes --]

diff --git a/net/netfilter/nf_nat_masquerade.c b/net/netfilter/nf_nat_masquerade.c
index 8e8a65d46345..50c6d6992ed6 100644
--- a/net/netfilter/nf_nat_masquerade.c
+++ b/net/netfilter/nf_nat_masquerade.c
@@ -9,8 +9,19 @@
 
 #include <net/netfilter/nf_nat_masquerade.h>
 
+struct masq_dev_work {
+	struct work_struct work;
+	struct net *net;
+	union nf_inet_addr addr;
+	int ifindex;
+	int (*iter)(struct nf_conn *i, void *data);
+};
+
+#define MAX_MASQ_WORKER_COUNT	16
+
 static DEFINE_MUTEX(masq_mutex);
 static unsigned int masq_refcnt __read_mostly;
+static atomic_t masq_worker_count __read_mostly;
 
 unsigned int
 nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int hooknum,
@@ -63,13 +74,68 @@ nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int hooknum,
 }
 EXPORT_SYMBOL_GPL(nf_nat_masquerade_ipv4);
 
-static int device_cmp(struct nf_conn *i, void *ifindex)
+static void iterate_cleanup_work(struct work_struct *work)
+{
+	struct masq_dev_work *w;
+
+	w = container_of(work, struct masq_dev_work, work);
+
+	nf_ct_iterate_cleanup_net(w->net, w->iter, (void *)w, 0, 0);
+
+	put_net(w->net);
+	kfree(w);
+	atomic_dec(&masq_worker_count);
+	module_put(THIS_MODULE);
+}
+
+/* Iterate conntrack table in the background and remove conntrack entries
+ * that use the device/address being removed.
+ *
+ * In case too many work items have been queued already or memory allocation
+ * fails iteration is skipped, conntrack entries will time out eventually.
+ */
+static void nf_nat_masq_schedule(struct net *net, union nf_inet_addr *addr,
+				 int ifindex,
+				 int (*iter)(struct nf_conn *i, void *data),
+				 gfp_t gfp_flags)
+{
+	struct masq_dev_work *w;
+
+	net = maybe_get_net(net);
+	if (!net)
+		return;
+
+	if (!try_module_get(THIS_MODULE))
+		goto err_module;
+
+	w = kzalloc(sizeof(*w), gfp_flags);
+	if (w) {
+		/* We can overshoot MAX_MASQ_WORKER_COUNT, no big deal */
+		atomic_inc(&masq_worker_count);
+
+		INIT_WORK(&w->work, iterate_cleanup_work);
+		w->ifindex = ifindex;
+		w->net = net;
+		w->iter = iter;
+		if (addr)
+			w->addr = *addr;
+		schedule_work(&w->work);
+		return;
+	}
+
+	module_put(THIS_MODULE);
+ err_module:
+	put_net(net);
+}
+
+static int device_cmp(struct nf_conn *i, void *arg)
 {
 	const struct nf_conn_nat *nat = nfct_nat(i);
+	const struct masq_dev_work *w = arg;
 
 	if (!nat)
 		return 0;
-	return nat->masq_index == (int)(long)ifindex;
+	return nat->masq_index == w->ifindex;
 }
 
 static int masq_device_event(struct notifier_block *this,
@@ -85,8 +151,8 @@ static int masq_device_event(struct notifier_block *this,
 		 * and forget them.
 		 */
 
-		nf_ct_iterate_cleanup_net(net, device_cmp,
-					  (void *)(long)dev->ifindex, 0, 0);
+		nf_nat_masq_schedule(net, NULL, dev->ifindex,
+				     device_cmp, GFP_KERNEL);
 	}
 
 	return NOTIFY_DONE;
@@ -94,35 +160,45 @@ static int masq_device_event(struct notifier_block *this,
 
 static int inet_cmp(struct nf_conn *ct, void *ptr)
 {
-	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
-	struct net_device *dev = ifa->ifa_dev->dev;
 	struct nf_conntrack_tuple *tuple;
+	struct masq_dev_work *w = ptr;
 
-	if (!device_cmp(ct, (void *)(long)dev->ifindex))
+	if (!device_cmp(ct, ptr))
 		return 0;
 
 	tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
 
-	return ifa->ifa_address == tuple->dst.u3.ip;
+	return nf_inet_addr_cmp(&w->addr, &tuple->dst.u3);
 }
 
 static int masq_inet_event(struct notifier_block *this,
 			   unsigned long event,
 			   void *ptr)
 {
-	struct in_device *idev = ((struct in_ifaddr *)ptr)->ifa_dev;
-	struct net *net = dev_net(idev->dev);
+	const struct in_ifaddr *ifa = ptr;
+	const struct in_device *idev;
+	const struct net_device *dev;
+	union nf_inet_addr addr;
+
+	if (event != NETDEV_DOWN)
+		return NOTIFY_DONE;
 
 	/* The masq_dev_notifier will catch the case of the device going
 	 * down.  So if the inetdev is dead and being destroyed we have
 	 * no work to do.  Otherwise this is an individual address removal
 	 * and we have to perform the flush.
 	 */
+	idev = ifa->ifa_dev;
 	if (idev->dead)
 		return NOTIFY_DONE;
 
-	if (event == NETDEV_DOWN)
-		nf_ct_iterate_cleanup_net(net, inet_cmp, ptr, 0, 0);
+	memset(&addr, 0, sizeof(addr));
+
+	addr.ip = ifa->ifa_address;
+
+	dev = idev->dev;
+	nf_nat_masq_schedule(dev_net(idev->dev), &addr, dev->ifindex,
+			     inet_cmp, GFP_KERNEL);
 
 	return NOTIFY_DONE;
 }
@@ -136,8 +212,6 @@ static struct notifier_block masq_inet_notifier = {
 };
 
 #if IS_ENABLED(CONFIG_IPV6)
-static atomic_t v6_worker_count __read_mostly;
-
 static int
 nat_ipv6_dev_get_saddr(struct net *net, const struct net_device *dev,
 		       const struct in6_addr *daddr, unsigned int srcprefs,
@@ -187,40 +261,6 @@ nf_nat_masquerade_ipv6(struct sk_buff *skb, const struct nf_nat_range2 *range,
 }
 EXPORT_SYMBOL_GPL(nf_nat_masquerade_ipv6);
 
-struct masq_dev_work {
-	struct work_struct work;
-	struct net *net;
-	struct in6_addr addr;
-	int ifindex;
-};
-
-static int inet6_cmp(struct nf_conn *ct, void *work)
-{
-	struct masq_dev_work *w = (struct masq_dev_work *)work;
-	struct nf_conntrack_tuple *tuple;
-
-	if (!device_cmp(ct, (void *)(long)w->ifindex))
-		return 0;
-
-	tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
-
-	return ipv6_addr_equal(&w->addr, &tuple->dst.u3.in6);
-}
-
-static void iterate_cleanup_work(struct work_struct *work)
-{
-	struct masq_dev_work *w;
-
-	w = container_of(work, struct masq_dev_work, work);
-
-	nf_ct_iterate_cleanup_net(w->net, inet6_cmp, (void *)w, 0, 0);
-
-	put_net(w->net);
-	kfree(w);
-	atomic_dec(&v6_worker_count);
-	module_put(THIS_MODULE);
-}
-
 /* atomic notifier; can't call nf_ct_iterate_cleanup_net (it can sleep).
  *
  * Defer it to the system workqueue.
@@ -233,36 +273,19 @@ static int masq_inet6_event(struct notifier_block *this,
 {
 	struct inet6_ifaddr *ifa = ptr;
 	const struct net_device *dev;
-	struct masq_dev_work *w;
-	struct net *net;
+	union nf_inet_addr addr;
 
-	if (event != NETDEV_DOWN || atomic_read(&v6_worker_count) >= 16)
+	if (event != NETDEV_DOWN)
 		return NOTIFY_DONE;
 
 	dev = ifa->idev->dev;
-	net = maybe_get_net(dev_net(dev));
-	if (!net)
-		return NOTIFY_DONE;
-
-	if (!try_module_get(THIS_MODULE))
-		goto err_module;
 
-	w = kmalloc(sizeof(*w), GFP_ATOMIC);
-	if (w) {
-		atomic_inc(&v6_worker_count);
+	memset(&addr, 0, sizeof(addr));
 
-		INIT_WORK(&w->work, iterate_cleanup_work);
-		w->ifindex = dev->ifindex;
-		w->net = net;
-		w->addr = ifa->addr;
-		schedule_work(&w->work);
+	addr.in6 = ifa->addr;
 
-		return NOTIFY_DONE;
-	}
-
-	module_put(THIS_MODULE);
- err_module:
-	put_net(net);
+	nf_nat_masq_schedule(dev_net(dev), &addr, dev->ifindex, inet_cmp,
+				 GFP_ATOMIC);
 	return NOTIFY_DONE;
 }
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14 11:00                             ` Florian Westphal
@ 2021-09-15 14:25                               ` Martin Zaharinov
  2021-09-15 14:37                                 ` Martin Zaharinov
  2021-09-16 20:00                               ` Martin Zaharinov
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-15 14:25 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Guillaume Nault, netdev

Hey Florian

make test in lab and look much better that before.

see this perf 

 PerfTop:    6551 irqs/sec  kernel:77.8%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    15.70%  [ixgbe]           [k] ixgbe_read_reg
    13.33%  [kernel]          [k] mutex_spin_on_owner
     7.65%  [kernel]          [k] osq_lock
     2.85%  libfrr.so.0.0.0   [.] 0x00000000000ce970
     1.94%  libfrr.so.0.0.0   [.] 0x00000000000ce94e
     1.19%  libc.so.6         [.] 0x0000000000186afa
     1.15%  [kernel]          [k] do_poll.constprop.0
     0.99%  [kernel]          [k] inet_dump_ifaddr
     0.94%  libteam.so.5.6.1  [.] 0x0000000000006470
     0.79%  libc.so.6         [.] 0x0000000000186e57
     0.71%  [ixgbe]           [k] ixgbe_update_mc_addr_list_generic
     0.65%  [kernel]          [k] __fget_files
     0.61%  [kernel]          [k] sock_poll
     0.57%  libteam.so.5.6.1  [.] 0x0000000000009e7d
     0.51%  perf              [.] 0x00000000000bc7b3
     0.51%  libteam.so.5.6.1  [.] 0x0000000000006501
     0.48%  [kernel]          [k] next_uptodate_page
     0.46%  [kernel]          [k] _raw_read_lock_bh
     0.43%  libc.so.6         [.] 0x0000000000186eac
     0.42%  bgpd              [.] 0x0000000000070a46
     0.41%  [pppoe]           [k] pppoe_flush_dev
     0.39%  [kernel]          [k] zap_pte_range


This happened when remove and add new interface on time of drop and reconnect users.


now : ip a command work fine !


Martin


> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
> 
> Martin Zaharinov <micron10@gmail.com> wrote:
> 
> [ Trimming CC list ]
> 
>> Florian: 
>> 
>> If you make patch send to test please.
> 
> Attached.  No idea if it helps, but 'ip' should stay responsive
> even when masquerade processes netdevice events.
> <defer_masq_work.diff>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-15 14:25                               ` Martin Zaharinov
@ 2021-09-15 14:37                                 ` Martin Zaharinov
  0 siblings, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-15 14:37 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Guillaume Nault, netdev

and this : 

  PerfTop:   26378 irqs/sec  kernel:61.4%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     5.65%  libfrr.so.0.0.0   [.] 0x00000000000ce970
     5.56%  [kernel]          [k] osq_lock
     5.22%  [kernel]          [k] mutex_spin_on_owner
     3.66%  [pppoe]           [k] pppoe_flush_dev
     3.01%  libfrr.so.0.0.0   [.] 0x00000000000ce94e
     1.98%  libc.so.6         [.] 0x00000000000965a2
     1.84%  libc.so.6         [.] 0x0000000000186afa
     1.55%  libc.so.6         [.] 0x0000000000186e57
     1.54%  zebra             [.] 0x00000000000b9271
     1.46%  zebra             [.] 0x00000000000b91f1
     1.46%  libteam.so.5.6.1  [.] 0x0000000000006470
     1.44%  libc.so.6         [.] 0x00000000000965a0
     1.30%  libteam.so.5.6.1  [.] 0x0000000000009e7d
     1.08%  [kernel]          [k] fib_table_flush
     1.02%  libc.so.6         [.] 0x0000000000186eac
     0.93%  [kernel]          [k] do_poll.constprop.0
     0.85%  libc.so.6         [.] 0x0000000000186afe
     0.80%  dtvbras           [.] 0x0000000000014be8
     0.78%  [kernel]          [k] queued_read_lock_slowpath
     0.72%  [kernel]          [k] next_uptodate_page
     0.64%  [kernel]          [k] zap_pte_range
     0.64%  bgpd              [.] 0x0000000000070a46
     0.61%  [kernel]          [k] fib_table_insert

> On 15 Sep 2021, at 17:25, Martin Zaharinov <micron10@gmail.com> wrote:
> 
> Hey Florian
> 
> make test in lab and look much better that before.
> 
> see this perf 
> 
> PerfTop:    6551 irqs/sec  kernel:77.8%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>    15.70%  [ixgbe]           [k] ixgbe_read_reg
>    13.33%  [kernel]          [k] mutex_spin_on_owner
>     7.65%  [kernel]          [k] osq_lock
>     2.85%  libfrr.so.0.0.0   [.] 0x00000000000ce970
>     1.94%  libfrr.so.0.0.0   [.] 0x00000000000ce94e
>     1.19%  libc.so.6         [.] 0x0000000000186afa
>     1.15%  [kernel]          [k] do_poll.constprop.0
>     0.99%  [kernel]          [k] inet_dump_ifaddr
>     0.94%  libteam.so.5.6.1  [.] 0x0000000000006470
>     0.79%  libc.so.6         [.] 0x0000000000186e57
>     0.71%  [ixgbe]           [k] ixgbe_update_mc_addr_list_generic
>     0.65%  [kernel]          [k] __fget_files
>     0.61%  [kernel]          [k] sock_poll
>     0.57%  libteam.so.5.6.1  [.] 0x0000000000009e7d
>     0.51%  perf              [.] 0x00000000000bc7b3
>     0.51%  libteam.so.5.6.1  [.] 0x0000000000006501
>     0.48%  [kernel]          [k] next_uptodate_page
>     0.46%  [kernel]          [k] _raw_read_lock_bh
>     0.43%  libc.so.6         [.] 0x0000000000186eac
>     0.42%  bgpd              [.] 0x0000000000070a46
>     0.41%  [pppoe]           [k] pppoe_flush_dev
>     0.39%  [kernel]          [k] zap_pte_range
> 
> 
> This happened when remove and add new interface on time of drop and reconnect users.
> 
> 
> now : ip a command work fine !
> 
> 
> Martin
> 
> 
>> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
>> 
>> Martin Zaharinov <micron10@gmail.com> wrote:
>> 
>> [ Trimming CC list ]
>> 
>>> Florian: 
>>> 
>>> If you make patch send to test please.
>> 
>> Attached.  No idea if it helps, but 'ip' should stay responsive
>> even when masquerade processes netdevice events.
>> <defer_masq_work.diff>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected
  2021-09-14 11:00                             ` Florian Westphal
  2021-09-15 14:25                               ` Martin Zaharinov
@ 2021-09-16 20:00                               ` Martin Zaharinov
  1 sibling, 0 replies; 23+ messages in thread
From: Martin Zaharinov @ 2021-09-16 20:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Guillaume Nault, netdev

Small Updates

After switch from frr to bird bgp reduce load from frr
but still when have disconnect 5k+ users have slow pppoe_flush_dev




   PerfTop:   15606 irqs/sec  kernel:77.7%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     8.24%  [kernel]              [k] osq_lock
     7.55%  [kernel]              [k] mutex_spin_on_owner
     7.04%  [pppoe]               [k] pppoe_flush_dev
     2.77%  libteam.so.5.6.1      [.] 0x0000000000009e7d
     2.67%  libteam.so.5.6.1      [.] 0x0000000000006470
     1.90%  [kernel]              [k] fib_table_flush
     1.73%  [kernel]              [k] queued_read_lock_slowpath
     1.68%  [kernel]              [k] next_uptodate_page
     1.36%  ip                    [.] 0x0000000000011b74
     1.23%  ip                    [.] 0x00000000000121b0
     1.09%  [kernel]              [k] zap_pte_range
     0.99%  libteam.so.5.6.1      [.] 0x0000000000006501
     0.88%  dtvbras               [.] 0x0000000000014be8
     0.87%  [kernel]              [k] inet_dump_ifaddr
     0.74%  [kernel]              [k] filemap_map_pages
     0.72%  [kernel]              [k] neigh_flush_dev.isra.0
     0.66%  [kernel]              [k] snmp_get_cpu_field
     0.65%  [kernel]              [k] fib_table_insert
     0.63%  [kernel]              [k] native_irq_return_iret
     0.63%  libteam.so.5.6.1      [.] 0x0000000000005c78
     0.60%  [kernel]              [k] copy_page
     0.52%  libteam.so.5.6.1      [.] 0x000000000000647f
     0.50%  [kernel]              [k] _raw_spin_lock
     0.48%  libc.so.6             [.] 0x00000000000965a2
     0.45%  [kernel]              [k] _raw_read_lock_bh
     0.44%  [kernel]              [k] release_pages
     0.42%  [kernel]              [k] clear_page_erms
     0.42%  [kernel]              [k] page_remove_rmap
     0.41%  [kernel]              [k] queued_spin_lock_slowpath
     0.38%  [kernel]              [k] kmem_cache_alloc
     0.36%  [kernel]              [k] vma_interval_tree_insert
     0.36%  libteam.so.5.6.1      [.] 0x0000000000009e6f
     0.36%  [kernel]              [k] do_set_pte


sessions:
  starting: 296
  active: 3868
  finishing: 6748



> On 14 Sep 2021, at 14:00, Florian Westphal <fw@strlen.de> wrote:
> 
> Martin Zaharinov <micron10@gmail.com> wrote:
> 
> [ Trimming CC list ]
> 
>> Florian: 
>> 
>> If you make patch send to test please.
> 
> Attached.  No idea if it helps, but 'ip' should stay responsive
> even when masquerade processes netdevice events.
> <defer_masq_work.diff>


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-09-16 20:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 20:53 Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected Martin Zaharinov
2021-08-06  4:40 ` Greg KH
2021-08-06  5:40   ` Martin Zaharinov
2021-08-08 15:14   ` Martin Zaharinov
2021-08-08 15:23     ` Pali Rohár
2021-08-08 15:29       ` Martin Zaharinov
2021-08-09 15:15         ` Pali Rohár
2021-08-10 18:27           ` Martin Zaharinov
2021-08-11 16:40             ` Guillaume Nault
2021-08-11 11:10           ` Martin Zaharinov
2021-08-11 16:48             ` Guillaume Nault
2021-09-07  6:16               ` Martin Zaharinov
2021-09-07  6:42                 ` Martin Zaharinov
2021-09-11  6:26                   ` Martin Zaharinov
2021-09-14  6:16                     ` Martin Zaharinov
2021-09-14  8:02                       ` Guillaume Nault
2021-09-14  9:50                         ` Florian Westphal
2021-09-14 10:01                           ` Martin Zaharinov
2021-09-14 11:00                             ` Florian Westphal
2021-09-15 14:25                               ` Martin Zaharinov
2021-09-15 14:37                                 ` Martin Zaharinov
2021-09-16 20:00                               ` Martin Zaharinov
2021-09-14 10:53                           ` Martin Zaharinov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.