All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: can't find logged information
       [not found] <24aa48bb-8f52-78be-28df-98ec2e2a054b@coolheads.com>
@ 2020-02-07 15:18 ` Steve Newcomb
  2020-02-07 15:52   ` Simon Wunderlich
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Newcomb @ 2020-02-07 15:18 UTC (permalink / raw)
  To: b.a.t.m.a.n


On 2/7/20 9:51 AM, Simon Wunderlich wrote:
> On Friday, February 7, 2020 3:13:47 PM CET Steve Newcomb wrote:
>> @rpc152:/tmp/log# echo "$(logread)" | grep batman
>> Thu Feb 6 15:21:13 2020 kern.warn kernel: [174193.938445] batman_adv:
>> [Deprecated]: batctl (pid 22747) Use of debugfs file "nc_nodes".
>> @rpc152:/tmp/log#
>>
>>
>> What have I missed?
> Hi Steve,
>
> you can use "batctl log" to retrieve the log. It will not appear in your
> logread.
Alas, that doesn't work either, and I don't know why:

root@rpc152:~# batctl log
Error - no valid command or debug table specified: log
Usage: batctl [options] command|debug table [parameters]
options:
         -h print this help (or 'batctl <command|debug table> -h' for 
the parameter help)
         -v print version

commands:
         meshif <netdev> aggregation|ag [0|1] display or modify 
aggregation setting
...
>
> When the problem happens, you can also check "iw wlan0 station dump" 
> and other
> debug files (batctl n for neighbors) to find out if the WiFi layer is 
> still
> working. It wouldn't be the first time that actually the WiFi chip or 
> driver
> has a problem, not batman-adv.
I've seen that "batctl n" works, and "iw mesh0 station dump" works, too.

I am arranging for the nodes to send me such mail when things have gone 
awry, but prior to rebooting.  I've written a tiny mail queueing system 
that optionally uses nonvolatile memory for the queue.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-07 15:18 ` can't find logged information Steve Newcomb
@ 2020-02-07 15:52   ` Simon Wunderlich
  2020-02-07 16:59     ` Steve Newcomb
  0 siblings, 1 reply; 9+ messages in thread
From: Simon Wunderlich @ 2020-02-07 15:52 UTC (permalink / raw)
  To: b.a.t.m.a.n, srn

[-- Attachment #1: Type: text/plain, Size: 2013 bytes --]

On Friday, February 7, 2020 4:18:19 PM CET Steve Newcomb wrote:
> On 2/7/20 9:51 AM, Simon Wunderlich wrote:
> > On Friday, February 7, 2020 3:13:47 PM CET Steve Newcomb wrote:
> >> @rpc152:/tmp/log# echo "$(logread)" | grep batman
> >> Thu Feb 6 15:21:13 2020 kern.warn kernel: [174193.938445] batman_adv:
> >> [Deprecated]: batctl (pid 22747) Use of debugfs file "nc_nodes".
> >> @rpc152:/tmp/log#
> >> 
> >> 
> >> What have I missed?
> > 
> > Hi Steve,
> > 
> > you can use "batctl log" to retrieve the log. It will not appear in your
> > logread.
> 
> Alas, that doesn't work either, and I don't know why:
> 
> root@rpc152:~# batctl log
> Error - no valid command or debug table specified: log
> Usage: batctl [options] command|debug table [parameters]
> options:
>          -h print this help (or 'batctl <command|debug table> -h' for
> the parameter help)
>          -v print version
> 
> commands:
>          meshif <netdev> aggregation|ag [0|1] display or modify
> aggregation setting
> ...
> 

Oops, you are right, we have actually removed that command in 2019.2. You can 
use one of the two following commands:

cat /sys/kernel/debug/batman_adv/bat0/log

(will be removed in the future when debugfs support is dropped

 trace-cmd stream -e batadv:batadv_dbg

> > When the problem happens, you can also check "iw wlan0 station dump"
> > and other
> > debug files (batctl n for neighbors) to find out if the WiFi layer is
> > still
> > working. It wouldn't be the first time that actually the WiFi chip or
> > driver
> > has a problem, not batman-adv.
> 
> I've seen that "batctl n" works, and "iw mesh0 station dump" works, too.
> 
> I am arranging for the nodes to send me such mail when things have gone
> awry, but prior to rebooting.  I've written a tiny mail queueing system
> that optionally uses nonvolatile memory for the queue.

By "works" you mean you get useful outputs where the timeout is not increasing 
or similar? can you still "batctl ping" to one of your neighbors?

Cheers,
      Simon

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-07 15:52   ` Simon Wunderlich
@ 2020-02-07 16:59     ` Steve Newcomb
  2020-02-07 17:07       ` Simon Wunderlich
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Newcomb @ 2020-02-07 16:59 UTC (permalink / raw)
  To: Simon Wunderlich, b.a.t.m.a.n


On 2/7/20 10:52 AM, Simon Wunderlich wrote:
>
> Oops, you are right, we have actually removed that command in 2019.2. You can
> use one of the two following commands:
>
> cat /sys/kernel/debug/batman_adv/bat0/log
The cat method works.  Thanks.  Wow, with all the debug info selected, 
it's a torrent of stuff.  (Which I expected.)

>
> (will be removed in the future when debugfs support is dropped
>
>   trace-cmd stream -e batadv:batadv_dbg
Uh, *which* of the two above commands will be dropped?  Should I add 
trace-cmd to my menuconfig to retain access to the logs?  I haven't 
added it yet.

 > By "works" you mean you get useful outputs where the timeout is not 
increasing or similar? can you still "batctl ping" to one of your neighbors?

You are far ahead of where I am.  I meant only that I was getting what 
appears to be useful information, not that I have interpreted it.  It 
will take me a while to figure out how to read it.  I would be grateful 
for any hints on that, and I'm already very grateful for the help you 
have provided.

I haven't been using "batctl ping" to cue a reboot; I've been using 
"ping".  I'm not sure what difference it might make to use "batctl 
ping"; it seems to me that if I can't "ping" a node, after 3 or 4 trials 
over a period of 15-20 seconds, I'm no longer in touch with it in any 
usable way.  When that happens, as long as I don't reboot the gateway, 
everything just stays offline.  So it's better to reboot.

By the way, at least one user is using the mesh for landline telephone 
(Google Voice) service, so I'd really like to stabilize it if I can.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-07 16:59     ` Steve Newcomb
@ 2020-02-07 17:07       ` Simon Wunderlich
  2020-02-11 19:03         ` Steve Newcomb
  0 siblings, 1 reply; 9+ messages in thread
From: Simon Wunderlich @ 2020-02-07 17:07 UTC (permalink / raw)
  To: srn; +Cc: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 2779 bytes --]

On Friday, February 7, 2020 5:59:27 PM CET Steve Newcomb wrote:
> On 2/7/20 10:52 AM, Simon Wunderlich wrote:
> > Oops, you are right, we have actually removed that command in 2019.2. You
> > can use one of the two following commands:
> > 
> > cat /sys/kernel/debug/batman_adv/bat0/log
> 
> The cat method works.  Thanks.  Wow, with all the debug info selected,
> it's a torrent of stuff.  (Which I expected.)
> 

Awesome

> > (will be removed in the future when debugfs support is dropped
> > 
> >   trace-cmd stream -e batadv:batadv_dbg
> 
> Uh, *which* of the two above commands will be dropped?  Should I add
> trace-cmd to my menuconfig to retain access to the logs?  I haven't
> added it yet.

the /sys/kernel/debug/batman_adv/bat0/log file will be removed in the future. 
So yes, add the trace-cmd 
> 
>  > By "works" you mean you get useful outputs where the timeout is not
> 
> increasing or similar? can you still "batctl ping" to one of your neighbors?
> 
> You are far ahead of where I am.  I meant only that I was getting what
> appears to be useful information, not that I have interpreted it.  It
> will take me a while to figure out how to read it.  I would be grateful
> for any hints on that, and I'm already very grateful for the help you
> have provided.
> 
> I haven't been using "batctl ping" to cue a reboot; I've been using
> "ping".  I'm not sure what difference it might make to use "batctl
> ping"; it seems to me that if I can't "ping" a node, after 3 or 4 trials
> over a period of 15-20 seconds, I'm no longer in touch with it in any
> usable way.  When that happens, as long as I don't reboot the gateway,
> everything just stays offline.  So it's better to reboot.
> 
> By the way, at least one user is using the mesh for landline telephone
> (Google Voice) service, so I'd really like to stabilize it if I can.

So in iw station dump and and batctl neighbor table, there is a "last seen" 
field or "inactive time". Usually those fields should be < 5 seconds or so, if 
they rise then it means this station has been lost somehow.

The difference between batctl ping and a regular ping is that the batctl ping 
is actually implemented INSIDE of batman-adv, so it will be used to ping a 
certain mesh participant. Therefore batctl ping relies on less "moving parts" 
such as correctly configured IPs, MAC address translation, etc. If batctl ping 
works but regular ping doesn't then we know something on the Ethernet 
transport part is off. The other way around (batctl ping doesn't work but 
regular ping does) is pretty much impossible unless the user does something 
wrong. If batctl ping to a neighbor doesn't work, it's likelythe Wifi layer or 
something else is exhibiting problems or batman-adv is not set up correctly.

Cheers,
       Simon

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-07 17:07       ` Simon Wunderlich
@ 2020-02-11 19:03         ` Steve Newcomb
  2020-02-19  9:50           ` Simon Wunderlich
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Newcomb @ 2020-02-11 19:03 UTC (permalink / raw)
  To: Simon Wunderlich; +Cc: b.a.t.m.a.n

First a report about trace-cmd.  It's not working right, I think, in the 
bleeding-edge ("trunk") distribution of OpenWRT that I'm using.

Then, in case you are inclined to respond, below are some 
observations/questions about the loss-of-contact I'm experiencing in my 
meshes.

---------------------------------------------

  I tried setting 'batctl ll bla'.  (I suspect something to do with 
bridge loop avoidance may be what's killing our meshes.)  After that, 
the cat /sys/kernel/debug/batman_adv/bat0/log method works pretty well, 
but the trace-cmd method produces little or nothing.

root@rpc152:~# trace-cmd stream -e batadv:batadv_dbg
trace-cmd: Invalid argument
   [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
trace-cmd: Invalid argument
   [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not 
defined
   [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
Hit Ctrl^C to stop recording

[the above lines appear to be coming from stderr]

[... here I wait a long time, and there's no output, so eventually I 
grow impatient and I interrupt ...]

^C^Croot@rpc152:~#

----------------------------------------------

Sometimes after interrupting I get a few lines of bridge-loop-avoidance 
log entries; here is such a case:

@rpc152:/root# trace-cmd stream -e batadv:batadv_dbg
trace-cmd: Invalid argument
   [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
trace-cmd: Invalid argument
   [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not 
defined
   [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
Hit Ctrl^C to stop recording
   C-c C-c          <idle>-0     [000] 81408.634044: 
batadv_dbg:           batman_adv bat0 batadv_bla_process_claim(): 
received a claim frame from another group. From: 0c:80:63:e1:59:40 on vi\
d 1 ...(hw_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
            <...>-4595  [000] 81409.423996: batadv_dbg:           
batman_adv bat0 batadv_bla_purge_claims(): timed out.
            <...>-4595  [000] 81409.424016: batadv_dbg:           
batman_adv bat0 batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
            <...>-4595  [000] 81409.424034: batadv_dbg:           
batman_adv bat0 batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
            <...>-4595  [000] 81409.424053: batadv_dbg:           
batman_adv bat0 batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on 
vid 1 (sent by 50:c7:bf:69:bb:17)...
            <...>-4595  [000] 81409.424065: batadv_dbg:           
batman_adv bat0 batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
            <...>-4595  [000] 81409.424092: batadv_dbg:           
batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 
on vid 1
           <idle>-0     [000] 81418.874031: batadv_dbg:           
batman_adv bat0 batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(h\
w_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
            <...>-4595  [000] 81419.663601: batadv_dbg:           
batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 
on vid 1
@rpc152:/root#

------------------------------------------------

With the cat /sys/kernel/debug/batman_adv/bat0/log method of 
log-reading, there's generally oodles of output.  At first it comes all 
in a rush, and then it continues more slowly but still pretty steadily:

@rpc152:/root# cat /sys/kernel/debug/batman_adv/bat0/log
[  81068650] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81069440] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81078890] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81079680] batadv_bla_purge_claims(): timed out.
[  81079680] batadv_bla_purge_claims(): 64:c7:53:e6:f8:ae, vid 32769
[  81079680] batadv_bla_send_claim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1
[  81079680] batadv_handle_unclaim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1 
(sent by 50:c7:bf:69:bb:17)...
[  81079680] batadv_bla_del_claim(): 64:c7:53:e6:f8:ae, vid 1
[  81079680] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81079680] Send loopdetect frame for vid 32769
[  81079680] batadv_bla_send_claim(): LOOPDETECT of ba:be:22:e0:3f:ec to 
ff:ff:ff:ff:ff:ff on vid 1
[  81089130] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81089920] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81099370] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81100160] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81109610] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81110400] batadv_bla_purge_claims(): timed out.
[  81110400] batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
[  81110400] batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
[  81110400] batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1 
(sent by 50:c7:bf:69:bb:17)...
[  81110400] batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
[  81110400] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81119850] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81120640] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81130090] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81130880] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81140330] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81141120] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81141120] Send loopdetect frame for vid 32769
[  81141120] batadv_bla_send_claim(): LOOPDETECT of ba:be:e5:06:19:6e to 
ff:ff:ff:ff:ff:ff on vid 1
[  81150100] batadv_bla_rx(): Unclaimed MAC 5c:1d:d9:54:09:8f found. 
Claim it. Local: no
[  81150100] batadv_bla_add_claim(): adding new entry 5c:1d:d9:54:09:8f, 
vid 1 to hash ...
[  81150100] batadv_bla_send_claim(): CLAIM 5c:1d:d9:54:09:8f on vid 1
[  81150570] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81151360] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81160810] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81161600] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81171050] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81171840] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81181290] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81182080] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81191530] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81192320] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81201770] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81202560] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81202560] Send loopdetect frame for vid 32769
[  81202560] batadv_bla_send_claim(): LOOPDETECT of ba:be:bb:25:c7:93 to 
ff:ff:ff:ff:ff:ff on vid 1
[  81212010] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81212800] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81222250] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81223040] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81232490] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81233280] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
[  81242730] batadv_bla_process_claim(): received a claim frame from 
another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src 
43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
[  81243520] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1

-------------------------------------------------------------------------

...so if I were you I wouldn't hurry to remove 
/sys/kernel/debug/batman_adv/bat0/log.  The trace-cmd stuff doesn't 
appear to work quite correctly yet.

-----------------------------------------------------------------------

Now my first question is: what does it mean to have a

"LOOPDETECT of ba:be:aa:c0:d6:bf to ff:ff:ff:ff:ff:ff"

?

-----------------------------------------------------------------------

When nodes become unreachable, they do so only partially.  Consider this 
weirdness I encountered two days ago:  Given nodes a, b, c, d, from the 
perspective of a, d has disappeared; in other words, "a# batctl ping d" 
doesn't work.  But I ssh'd from a to b, then from b to c, then from c to 
d, all successfully.  And "a# batctl ping d" still wasn't working, even 
though I was talking to d through that chain of ssh pipes.  Any ideas on 
what that might mean?  (When I reboot a -- the gateway -- everything 
always works again, usually for many hours, but never as long as a whole 
day.)

Assuming some mischief is being done in bridge-loop-avoidance-land, and 
bearing in mind that both meshes are physically laid out as simple 
linear daisy-chains, is there a way to force each node to ignore all 
nodes other than its immediate lefthand and righthand nodes?  If that's 
possible, it might allow an interesting experiment.

----------------------------------------------------------------------

Do I have a problem because the two meshes, and everything connected to 
them, all share the same LAN?  I note "received a claim frame from 
another group" in the above log excerpt.  (I don't know what that means, 
but I'm guessing that the two meshes are getting each other's 
maintenance traffic.)  Should the two meshes be separate subnets?

In a similar vein: Should each node be running its own subnet?

----------------------------------------------------------------------

Should I try changing all nodes over to BATMAN_V, rebooting them all, 
and hoping they re-establish contact?  (It would be massively 
inconvenient to have to reset them all physically.)

----------------------------------------------------------------------

Should I try turning off bridge loop avoidance?

---------------------------------------------------------------------

Should I try turning on ap_isolation?

---------------------------------------------------------------------

Any other ideas?


On 2/7/20 12:07 PM, Simon Wunderlich wrote:
> On Friday, February 7, 2020 5:59:27 PM CET Steve Newcomb wrote:
>> On 2/7/20 10:52 AM, Simon Wunderlich wrote:
>>> Oops, you are right, we have actually removed that command in 2019.2. You
>>> can use one of the two following commands:
>>>
>>> cat /sys/kernel/debug/batman_adv/bat0/log
>> The cat method works.  Thanks.  Wow, with all the debug info selected,
>> it's a torrent of stuff.  (Which I expected.)
>>
> Awesome
>
>>> (will be removed in the future when debugfs support is dropped
>>>
>>>    trace-cmd stream -e batadv:batadv_dbg
>> Uh, *which* of the two above commands will be dropped?  Should I add
>> trace-cmd to my menuconfig to retain access to the logs?  I haven't
>> added it yet.
> the /sys/kernel/debug/batman_adv/bat0/log file will be removed in the future.
> So yes, add the trace-cmd
>>   > By "works" you mean you get useful outputs where the timeout is not
>>
>> increasing or similar? can you still "batctl ping" to one of your neighbors?
>>
>> You are far ahead of where I am.  I meant only that I was getting what
>> appears to be useful information, not that I have interpreted it.  It
>> will take me a while to figure out how to read it.  I would be grateful
>> for any hints on that, and I'm already very grateful for the help you
>> have provided.
>>
>> I haven't been using "batctl ping" to cue a reboot; I've been using
>> "ping".  I'm not sure what difference it might make to use "batctl
>> ping"; it seems to me that if I can't "ping" a node, after 3 or 4 trials
>> over a period of 15-20 seconds, I'm no longer in touch with it in any
>> usable way.  When that happens, as long as I don't reboot the gateway,
>> everything just stays offline.  So it's better to reboot.
>>
>> By the way, at least one user is using the mesh for landline telephone
>> (Google Voice) service, so I'd really like to stabilize it if I can.
> So in iw station dump and and batctl neighbor table, there is a "last seen"
> field or "inactive time". Usually those fields should be < 5 seconds or so, if
> they rise then it means this station has been lost somehow.
>
> The difference between batctl ping and a regular ping is that the batctl ping
> is actually implemented INSIDE of batman-adv, so it will be used to ping a
> certain mesh participant. Therefore batctl ping relies on less "moving parts"
> such as correctly configured IPs, MAC address translation, etc. If batctl ping
> works but regular ping doesn't then we know something on the Ethernet
> transport part is off. The other way around (batctl ping doesn't work but
> regular ping does) is pretty much impossible unless the user does something
> wrong. If batctl ping to a neighbor doesn't work, it's likelythe Wifi layer or
> something else is exhibiting problems or batman-adv is not set up correctly.
>
> Cheers,
>         Simon


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-11 19:03         ` Steve Newcomb
@ 2020-02-19  9:50           ` Simon Wunderlich
  2020-02-19 15:00             ` Steve Newcomb
  0 siblings, 1 reply; 9+ messages in thread
From: Simon Wunderlich @ 2020-02-19  9:50 UTC (permalink / raw)
  To: srn; +Cc: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 16961 bytes --]

Hi Steve,

On Tuesday, February 11, 2020 8:03:32 PM CET Steve Newcomb wrote:
> First a report about trace-cmd.  It's not working right, I think, in the
> bleeding-edge ("trunk") distribution of OpenWRT that I'm using.
> 
> Then, in case you are inclined to respond, below are some
> observations/questions about the loss-of-contact I'm experiencing in my
> meshes.
> 
> ---------------------------------------------
> 
>   I tried setting 'batctl ll bla'.  (I suspect something to do with
> bridge loop avoidance may be what's killing our meshes.)  After that,
> the cat /sys/kernel/debug/batman_adv/bat0/log method works pretty well,
> but the trace-cmd method produces little or nothing.
> 
> root@rpc152:~# trace-cmd stream -e batadv:batadv_dbg
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not
> defined
>    [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
> Hit Ctrl^C to stop recording
> 
> [the above lines appear to be coming from stderr]
> 
> [... here I wait a long time, and there's no output, so eventually I
> grow impatient and I interrupt ...]
> 
> ^C^Croot@rpc152:~#
> 
> ----------------------------------------------
> 
> Sometimes after interrupting I get a few lines of bridge-loop-avoidance
> log entries; here is such a case:
> 
> @rpc152:/root# trace-cmd stream -e batadv:batadv_dbg
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not
> defined
>    [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
> Hit Ctrl^C to stop recording
>    C-c C-c          <idle>-0     [000] 81408.634044:
> batadv_dbg:           batman_adv bat0 batadv_bla_process_claim():
> received a claim frame from another group. From: 0c:80:63:e1:59:40 on vi\
> d 1 ...(hw_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
>             <...>-4595  [000] 81409.423996: batadv_dbg:          
> batman_adv bat0 batadv_bla_purge_claims(): timed out.
>             <...>-4595  [000] 81409.424016: batadv_dbg:          
> batman_adv bat0 batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
>             <...>-4595  [000] 81409.424034: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
>             <...>-4595  [000] 81409.424053: batadv_dbg:          
> batman_adv bat0 batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on
> vid 1 (sent by 50:c7:bf:69:bb:17)...
>             <...>-4595  [000] 81409.424065: batadv_dbg:          
> batman_adv bat0 batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
>             <...>-4595  [000] 81409.424092: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17
> on vid 1
>            <idle>-0     [000] 81418.874031: batadv_dbg:          
> batman_adv bat0 batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(h\
> w_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
>             <...>-4595  [000] 81419.663601: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17
> on vid 1
> @rpc152:/root#
> 
> ------------------------------------------------
> 
> With the cat /sys/kernel/debug/batman_adv/bat0/log method of
> log-reading, there's generally oodles of output.  At first it comes all
> in a rush, and then it continues more slowly but still pretty steadily:
> 
> @rpc152:/root# cat /sys/kernel/debug/batman_adv/bat0/log
> [  81068650] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81069440] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81078890] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81079680] batadv_bla_purge_claims(): timed out.
> [  81079680] batadv_bla_purge_claims(): 64:c7:53:e6:f8:ae, vid 32769
> [  81079680] batadv_bla_send_claim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1
> [  81079680] batadv_handle_unclaim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1
> (sent by 50:c7:bf:69:bb:17)...
> [  81079680] batadv_bla_del_claim(): 64:c7:53:e6:f8:ae, vid 1
> [  81079680] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81079680] Send loopdetect frame for vid 32769
> [  81079680] batadv_bla_send_claim(): LOOPDETECT of ba:be:22:e0:3f:ec to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81089130] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81089920] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81099370] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81100160] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81109610] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81110400] batadv_bla_purge_claims(): timed out.
> [  81110400] batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
> [  81110400] batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
> [  81110400] batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
> (sent by 50:c7:bf:69:bb:17)...
> [  81110400] batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
> [  81110400] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81119850] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81120640] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81130090] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81130880] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81140330] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81141120] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81141120] Send loopdetect frame for vid 32769
> [  81141120] batadv_bla_send_claim(): LOOPDETECT of ba:be:e5:06:19:6e to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81150100] batadv_bla_rx(): Unclaimed MAC 5c:1d:d9:54:09:8f found.
> Claim it. Local: no
> [  81150100] batadv_bla_add_claim(): adding new entry 5c:1d:d9:54:09:8f,
> vid 1 to hash ...
> [  81150100] batadv_bla_send_claim(): CLAIM 5c:1d:d9:54:09:8f on vid 1
> [  81150570] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81151360] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81160810] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81161600] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81171050] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81171840] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81181290] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81182080] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81191530] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81192320] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81201770] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81202560] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81202560] Send loopdetect frame for vid 32769
> [  81202560] batadv_bla_send_claim(): LOOPDETECT of ba:be:bb:25:c7:93 to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81212010] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81212800] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81222250] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81223040] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81232490] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81233280] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81242730] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81243520] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> 
> -------------------------------------------------------------------------
> 
> ...so if I were you I wouldn't hurry to remove
> /sys/kernel/debug/batman_adv/bat0/log.  The trace-cmd stuff doesn't
> appear to work quite correctly yet.


OK, thank you for reporting this!

> 
> -----------------------------------------------------------------------
> 
> Now my first question is: what does it mean to have a
> 
> "LOOPDETECT of ba:be:aa:c0:d6:bf to ff:ff:ff:ff:ff:ff"
> 
> ?

This means that your node has sent a LOOPDETECT packet. Those are sent out on 
the LAN to check if there is a loop somewhere which BLA couldn't resolve or 
where a misconfiguration happened. There are some scenarios which are not 
covered by BLA:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II#Limitations

> 
> -----------------------------------------------------------------------
> 
> When nodes become unreachable, they do so only partially.  Consider this
> weirdness I encountered two days ago:  Given nodes a, b, c, d, from the
> perspective of a, d has disappeared; in other words, "a# batctl ping d"
> doesn't work.  But I ssh'd from a to b, then from b to c, then from c to
> d, all successfully.  And "a# batctl ping d" still wasn't working, even
> though I was talking to d through that chain of ssh pipes.  Any ideas on
> what that might mean?  (When I reboot a -- the gateway -- everything
> always works again, usually for many hours, but never as long as a whole
> day.)

Hmm, that's strange indeed. Did you have good connection between all those 
devices? There is a certain "horizon", e.g. if you have many weak links in a 
daisy chain the the OGMs are dropped before they are reaching the end of the 
path.

Did you see node D in the originator table of node A?

> 
> Assuming some mischief is being done in bridge-loop-avoidance-land, and
> bearing in mind that both meshes are physically laid out as simple
> linear daisy-chains, is there a way to force each node to ignore all
> nodes other than its immediate lefthand and righthand nodes?  If that's
> possible, it might allow an interesting experiment.

We don't have an option to ignore nodes from a user perspective.

> 
> ----------------------------------------------------------------------
> 
> Do I have a problem because the two meshes, and everything connected to
> them, all share the same LAN?  I note "received a claim frame from
> another group" in the above log excerpt.  (I don't know what that means,
> but I'm guessing that the two meshes are getting each other's
> maintenance traffic.)  Should the two meshes be separate subnets?

It's possible and perfectly fine if you have two meshes connected to the same 
LAN like this:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-Testcases#Two-meshes-connected-by-one-LAN

Just make sure that the meshes are properly disconnected and not rejoin from 
time to time (e.g. by having different SSIDs)

> 
> In a similar vein: Should each node be running its own subnet?
> 
> ----------------------------------------------------------------------
> 
> Should I try changing all nodes over to BATMAN_V, rebooting them all,
> and hoping they re-establish contact?  (It would be massively
> inconvenient to have to reset them all physically.)

No, BATMAN V will not magically fix this.

> 
> ----------------------------------------------------------------------
> 
> Should I try turning off bridge loop avoidance?

bridge loop avoidance should be on as soon as you have any two nodes connected 
to the same LAN and mesh at one time.
> 
> ---------------------------------------------------------------------
> 
> Should I try turning on ap_isolation?

I don't see why you need ap_isolation. This is for external stations which are 
connected to APs of the same mesh and shouldn't talk to each other.

> 
> ---------------------------------------------------------------------
> 
> Any other ideas?

I think we should work on your a - b - c - d chain and find out why a can't 
talk to d. That seems like the most obvious symptom.

Cheers,
      Simon


> 
> On 2/7/20 12:07 PM, Simon Wunderlich wrote:
> > On Friday, February 7, 2020 5:59:27 PM CET Steve Newcomb wrote:
> >> On 2/7/20 10:52 AM, Simon Wunderlich wrote:
> >>> Oops, you are right, we have actually removed that command in 2019.2.
> >>> You
> >>> can use one of the two following commands:
> >>> 
> >>> cat /sys/kernel/debug/batman_adv/bat0/log
> >> 
> >> The cat method works.  Thanks.  Wow, with all the debug info selected,
> >> it's a torrent of stuff.  (Which I expected.)
> > 
> > Awesome
> > 
> >>> (will be removed in the future when debugfs support is dropped
> >>> 
> >>>    trace-cmd stream -e batadv:batadv_dbg
> >> 
> >> Uh, *which* of the two above commands will be dropped?  Should I add
> >> trace-cmd to my menuconfig to retain access to the logs?  I haven't
> >> added it yet.
> > 
> > the /sys/kernel/debug/batman_adv/bat0/log file will be removed in the
> > future. So yes, add the trace-cmd
> > 
> >>   > By "works" you mean you get useful outputs where the timeout is not
> >> 
> >> increasing or similar? can you still "batctl ping" to one of your
> >> neighbors?
> >> 
> >> You are far ahead of where I am.  I meant only that I was getting what
> >> appears to be useful information, not that I have interpreted it.  It
> >> will take me a while to figure out how to read it.  I would be grateful
> >> for any hints on that, and I'm already very grateful for the help you
> >> have provided.
> >> 
> >> I haven't been using "batctl ping" to cue a reboot; I've been using
> >> "ping".  I'm not sure what difference it might make to use "batctl
> >> ping"; it seems to me that if I can't "ping" a node, after 3 or 4 trials
> >> over a period of 15-20 seconds, I'm no longer in touch with it in any
> >> usable way.  When that happens, as long as I don't reboot the gateway,
> >> everything just stays offline.  So it's better to reboot.
> >> 
> >> By the way, at least one user is using the mesh for landline telephone
> >> (Google Voice) service, so I'd really like to stabilize it if I can.
> > 
> > So in iw station dump and and batctl neighbor table, there is a "last
> > seen"
> > field or "inactive time". Usually those fields should be < 5 seconds or
> > so, if they rise then it means this station has been lost somehow.
> > 
> > The difference between batctl ping and a regular ping is that the batctl
> > ping is actually implemented INSIDE of batman-adv, so it will be used to
> > ping a certain mesh participant. Therefore batctl ping relies on less
> > "moving parts" such as correctly configured IPs, MAC address translation,
> > etc. If batctl ping works but regular ping doesn't then we know something
> > on the Ethernet transport part is off. The other way around (batctl ping
> > doesn't work but regular ping does) is pretty much impossible unless the
> > user does something wrong. If batctl ping to a neighbor doesn't work,
> > it's likelythe Wifi layer or something else is exhibiting problems or
> > batman-adv is not set up correctly.
> > 
> > Cheers,
> > 
> >         Simon


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-19  9:50           ` Simon Wunderlich
@ 2020-02-19 15:00             ` Steve Newcomb
  0 siblings, 0 replies; 9+ messages in thread
From: Steve Newcomb @ 2020-02-19 15:00 UTC (permalink / raw)
  To: Simon Wunderlich; +Cc: b.a.t.m.a.n

I'm very grateful for your very helpful attention to my bizarre problem, 
Simon.

Two days ago, my efforts to instrument the nodes had the weird 
side-effect of making the problem go away.  So now I have no problem; 
only the mystery remains, and I have little hope of resolving it.

For the record, below are some details that seem relevant, at least to me.



On 2/19/20 4:50 AM, Simon Wunderlich wrote:
>
>> -----------------------------------------------------------------------
>>
>> When nodes become unreachable, they do so only partially.  Consider this
>> weirdness I encountered two days ago:  Given nodes a, b, c, d, from the
>> perspective of a, d has disappeared; in other words, "a# batctl ping d"
>> doesn't work.  But I ssh'd from a to b, then from b to c, then from c to
>> d, all successfully.  And "a# batctl ping d" still wasn't working, even
>> though I was talking to d through that chain of ssh pipes.  Any ideas on
>> what that might mean?  (When I reboot a -- the gateway -- everything
>> always works again, usually for many hours, but never as long as a whole
>> day.)
> Hmm, that's strange indeed. Did you have good connection between all those
> devices? There is a certain "horizon", e.g. if you have many weak links in a
> daisy chain the the OGMs are dropped before they are reaching the end of the
> path.
>
> Did you see node D in the originator table of node A?
As discussed below, when I added instrumentation, the problem 
disappeared.  (*insert muffled scream here*)
>
>> ----------------------------------------------------------------------
>>
>> Do I have a problem because the two meshes, and everything connected to
>> them, all share the same LAN?  I note "received a claim frame from
>> another group" in the above log excerpt.  (I don't know what that means,
>> but I'm guessing that the two meshes are getting each other's
>> maintenance traffic.)  Should the two meshes be separate subnets?
> It's possible and perfectly fine if you have two meshes connected to the same
> LAN like this:
>
> https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-Testcases#Two-meshes-connected-by-one-LAN
>
> Just make sure that the meshes are properly disconnected and not rejoin from
> time to time (e.g. by having different SSIDs)
I think I had missed this page.  Thanks for pointing it out.
>
>> In a similar vein: Should each node be running its own subnet?
>>
>> ----------------------------------------------------------------------
>>
>> Should I try changing all nodes over to BATMAN_V, rebooting them all,
>> and hoping they re-establish contact?  (It would be massively
>> inconvenient to have to reset them all physically.)
> No, BATMAN V will not magically fix this.
Then I won't switch to BATMAN_V.  "If it ain't broke, don't fix it."
>
>> ----------------------------------------------------------------------
>>
>> Should I try turning off bridge loop avoidance?
> bridge loop avoidance should be on as soon as you have any two nodes connected
> to the same LAN and mesh at one time.
Then I guess I don't need BLA.  I'm tempted to turn it off just to avoid 
the overhead, because only the gateways have wired access to the LAN, 
and all other nodes have only their respective meshes.
>
> I think we should work on your a - b - c - d chain and find out why a can't
> talk to d. That seems like the most obvious symptom.
I would do that if it were still broken!

Here's what I did, in some detail: rosepark dot us hash Feb182020




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: can't find logged information
  2020-02-07 14:13 Steve Newcomb
@ 2020-02-07 14:51 ` Simon Wunderlich
  0 siblings, 0 replies; 9+ messages in thread
From: Simon Wunderlich @ 2020-02-07 14:51 UTC (permalink / raw)
  To: b.a.t.m.a.n, srn

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

On Friday, February 7, 2020 3:13:47 PM CET Steve Newcomb wrote:
> @rpc152:/tmp/log# echo "$(logread)" | grep batman
> Thu Feb  6 15:21:13 2020 kern.warn kernel: [174193.938445] batman_adv:
> [Deprecated]: batctl (pid 22747) Use of debugfs file "nc_nodes".
> @rpc152:/tmp/log#
> 
> 
> What have I missed?

Hi Steve,

you can use "batctl log" to retrieve the log. It will not appear in your 
logread.

When the problem happens, you can also check "iw wlan0 station dump" and other 
debug files (batctl n for neighbors) to find out if the WiFi layer is still 
working. It wouldn't be the first time that actually the WiFi chip or driver 
has a problem, not batman-adv.

Cheers,
      Simon

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* can't find logged information
@ 2020-02-07 14:13 Steve Newcomb
  2020-02-07 14:51 ` Simon Wunderlich
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Newcomb @ 2020-02-07 14:13 UTC (permalink / raw)
  To: b.a.t.m.a.n

Here in upstate New York, USA, I'm having difficulty with 2 meshes, each 
with 4 nodes, both meshes running BATMAN_IV. All nodes are TP-Link 
Archer C7 or A7 routers running the latest OpenWRT trunk. All nodes are 
stationary.  The radio environment is pretty quiet, I think.  There is 
only one gateway in each mesh.  Nothing ever changes.  (A map of the 
layout can be found at rosepark dot us hash map.)

Nevertheless, each mesh stops working at least once or twice per day.  
If I reboot the gateway node of the one that stops working, the mesh 
starts working again.  In order to keep the meshes running, sort of, 
they now run a script I wrote that reboots them when they stop being 
able to ping each other.  It is not a very satisfactory solution.  If I 
could see what's going on, I might see how to make the meshes more 
stable, but I can't find any debug messages.

I compiled batctl-full and the kernel module with all options, including 
all debug options.  Here's a portion of a "make menuconfig" screen:

<*> kmod-batman-adv......................................... 
B.A.T.M.A.N. Adv
[*]   B.A.T.M.A.N. V protocol
[*]   Bridge Loop Avoidance
[*]   Distributed ARP Table
[*]   Network Coding
[*]   Multicast optimisation
[*]   batman-adv debugfs entries
[*]   B.A.T.M.A.N. debugging
[*]   batman-adv sysfs entries
[*]   B.A.T.M.A.N. tracing support


I run "batctl ll all" followed by "batctl ll" and I see:

@rpc152:/tmp/log# batctl ll
[ ] all debug output disabled (none)
[x] messages related to routing / flooding / broadcasting (batman)
[x] messages related to route added / changed / deleted (routes)
[x] messages related to translation table operations (tt)
[x] messages related to bridge loop avoidance (bla)
[x] messages related to arp snooping and distributed arp table (dat)
[x] messages related to network coding (nc)
[x] messages related to multicast (mcast)
[x] messages related to throughput meter (tp)

But the only debug-related log messages I see are:

@rpc152:/tmp/log# echo "$(dmesg)" | grep batman
[   18.672978] batman_adv: B.A.T.M.A.N. advanced 2019.5-openwrt-0 
(compatibility version 15) loaded
[   42.067698] batman_adv: bat0: Adding interface: mesh0
[   42.073065] batman_adv: bat0: The MTU of interface mesh0 is too small 
(1500) to handle the transport of batman-adv packets. Packets going over 
this interface will be fragmented on layer2 which could impact the 
performance. Setting the MTU to 1560 would \
solve the problem.
[   42.098069] batman_adv: bat0: Interface activated: mesh0
[174193.938445] batman_adv: [Deprecated]: batctl (pid 22747) Use of 
debugfs file "nc_nodes".


@rpc152:/tmp/log# echo "$(logread)" | grep batman
Thu Feb  6 15:21:13 2020 kern.warn kernel: [174193.938445] batman_adv: 
[Deprecated]: batctl (pid 22747) Use of debugfs file "nc_nodes".
@rpc152:/tmp/log#


What have I missed?

Thanks.

Steve Newcomb

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-02-19 15:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <24aa48bb-8f52-78be-28df-98ec2e2a054b@coolheads.com>
2020-02-07 15:18 ` can't find logged information Steve Newcomb
2020-02-07 15:52   ` Simon Wunderlich
2020-02-07 16:59     ` Steve Newcomb
2020-02-07 17:07       ` Simon Wunderlich
2020-02-11 19:03         ` Steve Newcomb
2020-02-19  9:50           ` Simon Wunderlich
2020-02-19 15:00             ` Steve Newcomb
2020-02-07 14:13 Steve Newcomb
2020-02-07 14:51 ` Simon Wunderlich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.