All of lore.kernel.org
 help / color / mirror / Atom feed
* [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
@ 2016-05-02 11:57 Karl Auer
  2016-05-02 14:54 ` Sven Eckelmann
  0 siblings, 1 reply; 5+ messages in thread
From: Karl Auer @ 2016-05-02 11:57 UTC (permalink / raw)
  To: BATMAN List

My apologies up front for a newbie question in this apparently very
technical list. If there is a more appropriate list or forum please
direct me to it.

I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL
-AR150 platform. It all works swimmingly, except that sometimes, for no
apparent reason, layer 3 connectivity across the mesh (inside it) is
lost. All nodes are still up and running and I can log into them via
their APs or their LAN interfaces. batctl on any node still sees all
the other nodes, and batctl can ping them by MAC address. Each node
still has its layer3 address and can ping it locally. The arp data for
the other nodes is correct (at least in my two-node test system). I
don't seem to be able to delete arp entries.

But no node can ping any other node by IP address. Restarting
networking on any or even all nodes doesn't help - all that helps is
rebooting everything.

As long as my mesh is six nodes on a table, rebooting everything is an
option. Once deployed - not so much :-)

It's possible to engineer this failure in a two-node mesh by just
restarting networking a few times quickly.

So I'm wondering a) if this is a known issue b) if this is an obvious
symptom of some stuff-up on my part, c) if there's a quick fix :-) and
d) failing all of the above whether there is some way out of the
situation without having to reboot every time.

The mesh has one node configured as a gateway, connected to an
upstream, and running DHCP on the bat interface. The other nodes are
configured as gateway clients and have nothing connected on their LAN
or WAN ports. All nodes have a static address on the bat0 interface,
all have a static route to the gateway's (inside) IP address. All have
an AP on the same radio as the mesh, with an RFC1918 network on it
masqueraded into the mesh.

And mostly, it works :-)

Yours hopefully, K.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer@biplane.com.au)
http://www.biplane.com.au/kauer
http://twitter.com/kauer389

GPG fingerprint: E00D 64ED 9C6A 8605 21E0 0ED0 EE64 2BEE CBCB C38B
Old fingerprint: 3C41 82BE A9E7 99A1 B931 5AE7 7638 0147 2C3C 2AC4




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
  2016-05-02 11:57 [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity Karl Auer
@ 2016-05-02 14:54 ` Sven Eckelmann
  2016-05-19 14:22   ` Nick Schaf
  0 siblings, 1 reply; 5+ messages in thread
From: Sven Eckelmann @ 2016-05-02 14:54 UTC (permalink / raw)
  To: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

On Monday 02 May 2016 21:57:49 Karl Auer wrote:
> My apologies up front for a newbie question in this apparently very
> technical list. If there is a more appropriate list or forum please
> direct me to it.
> 
> I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL
> -AR150 platform.

Are you using v2016.1 or some older version of batman-adv? If you use 
something like v2014.4.

What kind of layer 3 are you using? IPv4/IPv6/...?  What is you current 
configuration (for example are you have enabled DAT, BLA, ...). Did you check 
what exactly goes over the air and what the device (the adhoc one) 
receives/sends? What is what the data sent/received over the batman-adv 
devices?

Did you hardcode the mac address of the batman-adv device or are you let it 
change to a random value on each device creation? Is the device part of a 
bridge or is the IP configured directly on the batman-adv device?

Are you sure that the conntrack for the masquerade over the mesh isn't broken? 
Why are you masquerade over the mesh anyway?

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
  2016-05-02 14:54 ` Sven Eckelmann
@ 2016-05-19 14:22   ` Nick Schaf
  2016-05-20  7:23     ` Sven Eckelmann
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Schaf @ 2016-05-19 14:22 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

I've run into what sounds like a similar problem, but dove in and found more details.  Here's the setup:

-19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link).
-bat0 using ad-hoc interface on each node
-bat0 bridged (br-lan) with Ethernet
-br-lan on all nodes (except node 1) get DHCPv4 address from dnsmasq running on node 1
-A few PCs are hard-wired to LAN on node 1, one PC wired to LAN at node 11, all other nodes completely standalone
-WAN on node 1 connected to building network - only connection to any outside network
-DAT enabled
-BLA disabled

Problem:
After several weeks uptime, node 1 could no longer SSH or ping (L3) node 3.  Tcpdumps showed ping rec'd at node 3 and node 3 replied, but reply never arrived at node 1.  Linux PC wired to LAN at node 1 successfully pings node 3. L2 ping (via batctl) works between nodes 1 and 3.
Further investigation showed two entries for node 1's br-lan MAC in the global translation table at node 3.  Secondary entry was correct; primary entry pointed to node 4.  Node 4's tables (local and global) were both correct.

root@WifiMesh-03:~# batctl tg | grep c8:d3:a3:70:a9:b0
* 42:5e:78:f3:50:7e    0   (  2) via c8:d3:a3:70:a9:b0     ( 25)   (0xd7886ba8) [....]
* c8:d3:a3:70:a9:b0   -1   ( 19) via c8:d3:a3:70:a9:53     ( 19)   (0x10e4856e) [....]
 + c8:d3:a3:70:a9:b0   -1   (  2) via c8:d3:a3:70:a9:b0     ( 25)   (0x352c5b78) [....]
 * 42:5e:78:f3:50:7e   -1   (  2) via c8:d3:a3:70:a9:b0     ( 25)   (0x352c5b78) [....] 

(Yes, br-lan and adhoc0 have same MAC on node 1.  Yes, these are D-Link routers.)
...50:7e is bat0 at node 1, ...a9:b0 is adhoc0/br-lan at node 1, ...a9:53 is adhoc0/br-lan at node 4

This part may be odd: problem persisted for a few days while I investigated, but resolved immediately after viewing the tables on node 4.  May be coincidence, though, because it didn't work for the following nodes.

At same time, same problem existed with two other nodes on the mesh: node 13 (an OM2P-HS) matched node 3's global table; node 9 (a WRTnode) showed a primary entry for node 1's br-lan using yet another originator.  Rebooted nodes to resolve.
Problem happened again more recently, but the destination MAC was that of the Linux PC mentioned above, attached to the LAN on node 1.  In this case, most nodes' global tables showed two entries for that MAC, though the originator in the primary entry was not consistent.
 
I plan to test with a more recent version of BATMAN-adv once I standardize on one model of hardware for the nodes (should be within the next few months).  In the meantime, I plan to watch for secondary entries in the global translation tables since the current configuration should never result one client being accessible through multiple nodes.

Thanks,
-Nick

 
-----Original Message-----
From: B.A.T.M.A.N [mailto:b.a.t.m.a.n-bounces@lists.open-mesh.org] On Behalf Of Sven Eckelmann
Sent: Monday, May 02, 2016 9:55 AM
To: b.a.t.m.a.n@lists.open-mesh.org
Subject: Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity

On Monday 02 May 2016 21:57:49 Karl Auer wrote:
> My apologies up front for a newbie question in this apparently very 
> technical list. If there is a more appropriate list or forum please 
> direct me to it.
> 
> I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL
> -AR150 platform.

Are you using v2016.1 or some older version of batman-adv? If you use something like v2014.4.

What kind of layer 3 are you using? IPv4/IPv6/...?  What is you current configuration (for example are you have enabled DAT, BLA, ...). Did you check what exactly goes over the air and what the device (the adhoc one) receives/sends? What is what the data sent/received over the batman-adv devices?

Did you hardcode the mac address of the batman-adv device or are you let it change to a random value on each device creation? Is the device part of a bridge or is the IP configured directly on the batman-adv device?

Are you sure that the conntrack for the masquerade over the mesh isn't broken? 
Why are you masquerade over the mesh anyway?

Kind regards,
	Sven

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
  2016-05-19 14:22   ` Nick Schaf
@ 2016-05-20  7:23     ` Sven Eckelmann
  2016-05-20  7:33       ` Antonio Quartulli
  0 siblings, 1 reply; 5+ messages in thread
From: Sven Eckelmann @ 2016-05-20  7:23 UTC (permalink / raw)
  To: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 989 bytes --]

On Thursday 19 May 2016 14:22:58 Nick Schaf wrote:
> I've run into what sounds like a similar problem, but dove in and found more
> details.  Here's the setup:
> 
> -19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various
> hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link).

Only scrolled through your mail. But there are two things which I find odd.
First you use a really old (actually not existing) version of batman-adv.

Then you have some TT problems. I think we had many fixes since then which may
be related to your problem. But going through 2 years of fixes might be a 
quite hard (at very cumbersome) journey. 

Maybe it is really a good idea to try to upgrade to the recent version 
(2016.1+fixes) from the Chaos Calmers openwrt-routing feed on all your nodes. 
Maybe Antonio remembers one special TT/roaming bug and can recommend one to 
test. But most likely testing the current version is easier.

But thanks for gathering all the info

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
  2016-05-20  7:23     ` Sven Eckelmann
@ 2016-05-20  7:33       ` Antonio Quartulli
  0 siblings, 0 replies; 5+ messages in thread
From: Antonio Quartulli @ 2016-05-20  7:33 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 1270 bytes --]

On Fri, May 20, 2016 at 09:23:24AM +0200, Sven Eckelmann wrote:
> On Thursday 19 May 2016 14:22:58 Nick Schaf wrote:
> > I've run into what sounds like a similar problem, but dove in and found more
> > details.  Here's the setup:
> > 
> > -19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various
> > hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link).
> 
> Only scrolled through your mail. But there are two things which I find odd.
> First you use a really old (actually not existing) version of batman-adv.
> 
> Then you have some TT problems. I think we had many fixes since then which may
> be related to your problem. But going through 2 years of fixes might be a 
> quite hard (at very cumbersome) journey. 
> 
> Maybe it is really a good idea to try to upgrade to the recent version 
> (2016.1+fixes) from the Chaos Calmers openwrt-routing feed on all your nodes. 
> Maybe Antonio remembers one special TT/roaming bug and can recommend one to 
> test. But most likely testing the current version is easier.

I don't recall any superfix which might magically solve the problems you
are seeing. Therefore I'd just follow Sven's suggestion and try running a recent
version of batman-adv.

Cheers,


-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-20  7:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-02 11:57 [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity Karl Auer
2016-05-02 14:54 ` Sven Eckelmann
2016-05-19 14:22   ` Nick Schaf
2016-05-20  7:23     ` Sven Eckelmann
2016-05-20  7:33       ` Antonio Quartulli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.