b.a.t.m.a.n.lists.open-mesh.org archive mirror
 help / color / mirror / Atom feed
* [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT?
@ 2013-06-23 21:51 Gui Iribarren
  2013-06-23 22:28 ` Gui Iribarren
  0 siblings, 1 reply; 5+ messages in thread
From: Gui Iribarren @ 2013-06-23 21:51 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

Hey everyone!
fiiinally got back home, a week ago, and got time to debug a strange 
issue here. The report i had from a few users was "intermittent 
connectivity", with "waves" of traffic, with random pauses lasting from 
a few seconds to a minute or so.

I initially dismissed as interference, or even OS problems, but turns 
out they were right! and sadly, batman seems to be in the way

 From what i've seen, watching "batctl tg -w" on every node along the 
way, i could determine the window of time where the traffic gets lost:
from the moment when there's a TT change on one side of the network,
to the moment that change is propagated to the other side.

By ordex's advice, i ran some "batctl ll tt ; batctl l" along the way 
and i'm sending the pastebin results at the end of this mail.

Some (hopefully) useful context follows, and a batctl vd graph is attached

The IPv6 of tdorado is pinged (to rule out DAT interactions) from 
labanda-este (works fine always) and from labanda-oeste (suffers the 
issue, as well as all nodes "behind" it, i.e. casapuente & alfredo).
both labandas are tl-wdr3500 connected by 2.4ghz, 5ghz, and an ethernet 
cable. The ethernet carries only batadv packets (eth0.1 is added to 
bat0); there's no "lan backbone" (the eth0.2 that appears under br-lan 
is not connected to anything)

root@tdorado:~# opkg list kmod-batman-adv # same in all nodes
kmod-batman-adv - 3.8.3+2013.2.0-2
root@tdorado:~# ip a s br-lan
6: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
state UP
     link/ether 64:70:02:3d:a0:f7 brd ff:ff:ff:ff:ff:ff
     inet6 2a00:1508:1:f004:6670:2ff:fe3d:a0f7/64 scope global dynamic
        valid_lft 6985sec preferred_lft 6985sec
root@tdorado:~# batctl tl -n |grep f7
  * 64:70:02:3d:a0:f7 [.....]   1.140
root@tdorado:~# batctl o |head -n 1
[B.A.T.M.A.N. adv 2013.2.0, MainIF/MAC: wlan0-1/66:70:02:3d:a0:f9 (bat0)]

labanda-este
http://pastebin.com/R1kHQCQG

labanda-oeste
http://pastebin.com/b1Uc23VZ

Both ping6s were started at the same time, so the seq numbers are 
synchronized, and can be used as timestamps.

the "gap" in labanda-oeste is between seq=73 and seq=89
in labanda-oeste there were no messages or traffic for 25secs, and then 
the "TT inconsistency" came up, resolved, and seq=89 succeded, traffic 
restored.
at that time, seq=74, labanda-este got a TT update:
[  23161800] Deleting tdorado from global tt entry 44:d8:84:b0:d2:f5: tt 
removed by changes
and (AFAIU) dropped traffic coming from labanda-oeste until 
labanda-oeste finally got the TT update and increased the ttvn to 129

does any of this make sense? I imagine a tcpdump would help, so i'll try 
getting one, but maybe this debug was enough to get an insight?

As you can imagine, any further pointer will be greatly appreciated,

I hope you're having a great week, ...and that i'm not ruining it as 
always :D

Gui

[-- Attachment #2: dl.svg --]
[-- Type: image/svg+xml, Size: 15801 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT?
  2013-06-23 21:51 [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT? Gui Iribarren
@ 2013-06-23 22:28 ` Gui Iribarren
  2013-06-24  7:02   ` Antonio Quartulli
  0 siblings, 1 reply; 5+ messages in thread
From: Gui Iribarren @ 2013-06-23 22:28 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On 06/23/2013 06:51 PM, Gui Iribarren wrote:
> Hey everyone!
> fiiinally got back home, a week ago, and got time to debug a strange
> issue here. The report i had from a few users was "intermittent
> connectivity", with "waves" of traffic, with random pauses lasting from
> a few seconds to a minute or so.
>
> I initially dismissed as interference, or even OS problems, but turns
> out they were right! and sadly, batman seems to be in the way
>
>  From what i've seen, watching "batctl tg -w" on every node along the
> way, i could determine the window of time where the traffic gets lost:
> from the moment when there's a TT change on one side of the network,
> to the moment that change is propagated to the other side.
>
> By ordex's advice, i ran some "batctl ll tt ; batctl l" along the way
> and i'm sending the pastebin results at the end of this mail.
>
> Some (hopefully) useful context follows, and a batctl vd graph is attached
>
> The IPv6 of tdorado is pinged (to rule out DAT interactions) from
> labanda-este (works fine always) and from labanda-oeste (suffers the
> issue, as well as all nodes "behind" it, i.e. casapuente & alfredo).
> both labandas are tl-wdr3500 connected by 2.4ghz, 5ghz, and an ethernet
> cable. The ethernet carries only batadv packets (eth0.1 is added to
> bat0); there's no "lan backbone" (the eth0.2 that appears under br-lan
> is not connected to anything)
>
> root@tdorado:~# opkg list kmod-batman-adv # same in all nodes
> kmod-batman-adv - 3.8.3+2013.2.0-2
> root@tdorado:~# ip a s br-lan
> 6: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP
>      link/ether 64:70:02:3d:a0:f7 brd ff:ff:ff:ff:ff:ff
>      inet6 2a00:1508:1:f004:6670:2ff:fe3d:a0f7/64 scope global dynamic
>         valid_lft 6985sec preferred_lft 6985sec
> root@tdorado:~# batctl tl -n |grep f7
>   * 64:70:02:3d:a0:f7 [.....]   1.140
> root@tdorado:~# batctl o |head -n 1
> [B.A.T.M.A.N. adv 2013.2.0, MainIF/MAC: wlan0-1/66:70:02:3d:a0:f9 (bat0)]
>
> labanda-este
> http://pastebin.com/R1kHQCQG
>
> labanda-oeste
> http://pastebin.com/b1Uc23VZ
>
> Both ping6s were started at the same time, so the seq numbers are
> synchronized, and can be used as timestamps.
>
> the "gap" in labanda-oeste is between seq=73 and seq=89
> in labanda-oeste there were no messages or traffic for 25secs, and then
> the "TT inconsistency" came up, resolved, and seq=89 succeded, traffic
> restored.
> at that time, seq=74, labanda-este got a TT update:
> [  23161800] Deleting tdorado from global tt entry 44:d8:84:b0:d2:f5: tt
> removed by changes
> and (AFAIU) dropped traffic coming from labanda-oeste until
> labanda-oeste finally got the TT update and increased the ttvn to 129

To clarify, what i mean is that during the time window that packets from 
labanda-oeste get lost, i see this

root@labanda-este:~# batctl tg |grep tdorado_eth0
  *      tdorado_eth0  (  0) via           tdorado     ( 77)   (0xe574) 
[...]

root@labanda-oeste:~# batctl tg|grep tdorado_eth0
  *      tdorado_eth0  ( 75) via           tdorado     ( 76)   (0x03db) 
[...]

and traffic is restored the instant oeste gets ttvn=77 / crc=0xe574

finally, i must add that running a "batctl ping tdorado" from 
labanda-oeste in parallel shows no loss during the whole scenario.

>
> does any of this make sense? I imagine a tcpdump would help, so i'll try
> getting one, but maybe this debug was enough to get an insight?
>
> As you can imagine, any further pointer will be greatly appreciated,
>
> I hope you're having a great week, ...and that i'm not ruining it as
> always :D
>
> Gui

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT?
  2013-06-23 22:28 ` Gui Iribarren
@ 2013-06-24  7:02   ` Antonio Quartulli
  2013-06-25  4:14     ` Gui Iribarren
  0 siblings, 1 reply; 5+ messages in thread
From: Antonio Quartulli @ 2013-06-24  7:02 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On Sun, Jun 23, 2013 at 07:28:08PM -0300, Gui Iribarren wrote:
> On 06/23/2013 06:51 PM, Gui Iribarren wrote:
> >
> > the "gap" in labanda-oeste is between seq=73 and seq=89
> > in labanda-oeste there were no messages or traffic for 25secs, and then
> > the "TT inconsistency" came up, resolved, and seq=89 succeded, traffic
> > restored.
> > at that time, seq=74, labanda-este got a TT update:
> > [  23161800] Deleting tdorado from global tt entry 44:d8:84:b0:d2:f5: tt
> > removed by changes
> > and (AFAIU) dropped traffic coming from labanda-oeste until
> > labanda-oeste finally got the TT update and increased the ttvn to 129

As we clarified on IRC, BLA2 is not involved at all.
From what I can see I'd say that a node in the middle is dropping the traffic
due to a wrong re-routing operation. if you could reproduce the issue, it would
be nice if you could get the tt log on all the nodes along the path.
In this way we can clearly see who is doing what.

Cheers,

-- 
Antonio Quartulli

..each of us alone is worth nothing..
Ernesto "Che" Guevara

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT?
  2013-06-24  7:02   ` Antonio Quartulli
@ 2013-06-25  4:14     ` Gui Iribarren
  2013-06-25  6:38       ` Antonio Quartulli
  0 siblings, 1 reply; 5+ messages in thread
From: Gui Iribarren @ 2013-06-25  4:14 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking
  Cc: Antonio Quartulli

On 06/24/2013 04:02 AM, Antonio Quartulli wrote:
> On Sun, Jun 23, 2013 at 07:28:08PM -0300, Gui Iribarren wrote:
>> On 06/23/2013 06:51 PM, Gui Iribarren wrote:
>>>
>>> the "gap" in labanda-oeste is between seq=73 and seq=89
>>> in labanda-oeste there were no messages or traffic for 25secs, and then
>>> the "TT inconsistency" came up, resolved, and seq=89 succeded, traffic
>>> restored.
>>> at that time, seq=74, labanda-este got a TT update:
>>> [  23161800] Deleting tdorado from global tt entry 44:d8:84:b0:d2:f5: tt
>>> removed by changes
>>> and (AFAIU) dropped traffic coming from labanda-oeste until
>>> labanda-oeste finally got the TT update and increased the ttvn to 129
>
> As we clarified on IRC, BLA2 is not involved at all.
>  From what I can see I'd say that a node in the middle is dropping the traffic
> due to a wrong re-routing operation. if you could reproduce the issue, it would
> be nice if you could get the tt log on all the nodes along the path.
> In this way we can clearly see who is doing what.

(11:58:09 AM) gui_: ah, ordex, btw, there's no "in the middle" :(
(11:58:29 AM) ordex: what do you mean?
(11:58:41 AM) ordex: does it happen with two nodes only?
(11:58:44 AM) gui_: labanda-este is a direct neighbor of tdorado
(11:58:49 AM) gui_: i mean the path is
(11:58:53 AM) gui_: oeste -> este -> tdorado

And, finally, to throw some spice into the mix, i downgraded 
kmod-batman-adv just in labanda-este (my key suspect) twice, and...
i can reproduce this issue with 3.8.3+2013.2.0-2 and 3.8.3+2013.1.0-3
but not with 3.8.3+2012.4.0-1

both labanda-oeste and tdorado are still running 2013.2.0. Downgrading 
labanda-este to 2012.4.0 was enough to solve the issue.

so it looks like a regression introduced between 2012.4.0 and 2013.1.0

yet, it is pretty obscure to me what is actually happening, (and why 
only in this scenario) so i'll try to get some dumps as promised, but 
since my current internet connection depends on this link, i was a bit 
in a hurry to get it back to stable :P

Cheers!

Gui

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT?
  2013-06-25  4:14     ` Gui Iribarren
@ 2013-06-25  6:38       ` Antonio Quartulli
  0 siblings, 0 replies; 5+ messages in thread
From: Antonio Quartulli @ 2013-06-25  6:38 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 2162 bytes --]

On Tue, Jun 25, 2013 at 01:14:09AM -0300, Gui Iribarren wrote:
> On 06/24/2013 04:02 AM, Antonio Quartulli wrote:
> > On Sun, Jun 23, 2013 at 07:28:08PM -0300, Gui Iribarren wrote:
> >> On 06/23/2013 06:51 PM, Gui Iribarren wrote:
> >>>
> >>> the "gap" in labanda-oeste is between seq=73 and seq=89
> >>> in labanda-oeste there were no messages or traffic for 25secs, and then
> >>> the "TT inconsistency" came up, resolved, and seq=89 succeded, traffic
> >>> restored.
> >>> at that time, seq=74, labanda-este got a TT update:
> >>> [  23161800] Deleting tdorado from global tt entry 44:d8:84:b0:d2:f5: tt
> >>> removed by changes
> >>> and (AFAIU) dropped traffic coming from labanda-oeste until
> >>> labanda-oeste finally got the TT update and increased the ttvn to 129
> >
> > As we clarified on IRC, BLA2 is not involved at all.
> >  From what I can see I'd say that a node in the middle is dropping the traffic
> > due to a wrong re-routing operation. if you could reproduce the issue, it would
> > be nice if you could get the tt log on all the nodes along the path.
> > In this way we can clearly see who is doing what.
> 
> (11:58:09 AM) gui_: ah, ordex, btw, there's no "in the middle" :(
> (11:58:29 AM) ordex: what do you mean?
> (11:58:41 AM) ordex: does it happen with two nodes only?
> (11:58:44 AM) gui_: labanda-este is a direct neighbor of tdorado
> (11:58:49 AM) gui_: i mean the path is
> (11:58:53 AM) gui_: oeste -> este -> tdorado
> 
> And, finally, to throw some spice into the mix, i downgraded 
> kmod-batman-adv just in labanda-este (my key suspect) twice, and...
> i can reproduce this issue with 3.8.3+2013.2.0-2 and 3.8.3+2013.1.0-3
> but not with 3.8.3+2012.4.0-1
> 
> both labanda-oeste and tdorado are still running 2013.2.0. Downgrading 
> labanda-este to 2012.4.0 was enough to solve the issue.
> 
> so it looks like a regression introduced between 2012.4.0 and 2013.1.0

why didn't you 2013.0.0 since you were jumping from version to version? :)

Thanks a lot for doing these tests!

Cheers,

-- 
Antonio Quartulli

..each of us alone is worth nothing..
Ernesto "Che" Guevara

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-06-25  6:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-23 21:51 [B.A.T.M.A.N.] Possible bad interaction between BLA2 and TT? Gui Iribarren
2013-06-23 22:28 ` Gui Iribarren
2013-06-24  7:02   ` Antonio Quartulli
2013-06-25  4:14     ` Gui Iribarren
2013-06-25  6:38       ` Antonio Quartulli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).