b.a.t.m.a.n.lists.open-mesh.org archive mirror
 help / color / mirror / Atom feed
* [B.A.T.M.A.N.] Kernel crashes with batgat installed
@ 2009-05-19 14:27 Nathan Wharton
  2009-05-19 19:21 ` Sven Eckelmann
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Nathan Wharton @ 2009-05-19 14:27 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

I am using batman 1256 on a very recent openwrt (linux version
2.6.28.10) as well as a bit older one (linux version 2.6.26.8).

With batgat installed, I have problems with the kernel crashing when
turning the gateway on and off.  I start batman with -r 2.  If I
detect an uplink, I issue -c -g 11000.  If I lose the link, I issue -c
-r 2.  It is this final -c -r 2 that causes the kernel to either crash
with a bad page on the next process that is created, have a null
pointer error, or have a recursion error.

If I run batman without batgat, I don't get any crashes.

Everything works fine otherwise.  Except one thing that just came to
mind, I had to remove -DDEBUG_MALLOC -DMEMORY_USAGE because batman
wouldn't do anything without crashing because of magic number
problems.  Could this be because I am on Big Endian hardware?

Could anyone else see if they have the same problem?  All you have to
do is have batman running with batgat installed, start issuing batmand
-c -g 11000 ; batmand -c -r 2 multiple times and see if their system
stays stable.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-19 14:27 [B.A.T.M.A.N.] Kernel crashes with batgat installed Nathan Wharton
@ 2009-05-19 19:21 ` Sven Eckelmann
  2009-05-19 20:38   ` Nathan Wharton
  2009-05-28 10:40 ` [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures Sven Eckelmann
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Sven Eckelmann @ 2009-05-19 19:21 UTC (permalink / raw)
  To: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

Hi,
thanks for your report. I am currently running some stress tests on x86 and 
mips and couldn't reproduce any such problems. So I have some questions 
regarding your configuration.

On Tuesday 19 May 2009 16:27:25 Nathan Wharton wrote:
> I am using batman 1256 on a very recent openwrt (linux version
> 2.6.28.10) as well as a bit older one (linux version 2.6.26.8).
What is your target architecture in openwrt?  Have you tried to reproduce that 
problem on another architecture?

> With batgat installed, I have problems with the kernel crashing when
> turning the gateway on and off.  I start batman with -r 2.  If I
> detect an uplink, I issue -c -g 11000.  If I lose the link, I issue -c
> -r 2.  It is this final -c -r 2 that causes the kernel to either crash
> with a bad page on the next process that is created, have a null
> pointer error, or have a recursion error.
Can you create a readable kernel backtrace with ksymoops?

> If I run batman without batgat, I don't get any crashes.
>
> Everything works fine otherwise.  Except one thing that just came to
> mind, I had to remove -DDEBUG_MALLOC -DMEMORY_USAGE because batman
> wouldn't do anything without crashing because of magic number
> problems.  Could this be because I am on Big Endian hardware?
I am running it also on big endian hardware and it seems to work. Does it 
happen right after the start or were extra interaction needed? What was the 
error output?

> Could anyone else see if they have the same problem?  All you have to
> do is have batman running with batgat installed, start issuing batmand
> -c -g 11000 ; batmand -c -r 2 multiple times and see if their system
> stays stable.
I am running it in a while true loop since an hour on x86 and mips on isolated 
and non isolated (single partner) nodes and didn't get such problems.

Regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-19 19:21 ` Sven Eckelmann
@ 2009-05-19 20:38   ` Nathan Wharton
  2009-05-20  1:30     ` Marek Lindner
  0 siblings, 1 reply; 24+ messages in thread
From: Nathan Wharton @ 2009-05-19 20:38 UTC (permalink / raw)
  To: Sven Eckelmann; +Cc: b.a.t.m.a.n

Thanks for your reply.  I will answer inline below:

On Tue, May 19, 2009 at 2:21 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> Hi,
> thanks for your report. I am currently running some stress tests on x86 and
> mips and couldn't reproduce any such problems. So I have some questions
> regarding your configuration.
>
> On Tuesday 19 May 2009 16:27:25 Nathan Wharton wrote:
>> I am using batman 1256 on a very recent openwrt (linux version
>> 2.6.28.10) as well as a bit older one (linux version 2.6.26.8).
> What is your target architecture in openwrt?  Have you tried to reproduce that
> problem on another architecture?

The target is a Gateworks Avila 2348-4 board, which has an IXP425.
I haven't tried another target yet.

>> With batgat installed, I have problems with the kernel crashing when
>> turning the gateway on and off.  I start batman with -r 2.  If I
>> detect an uplink, I issue -c -g 11000.  If I lose the link, I issue -c
>> -r 2.  It is this final -c -r 2 that causes the kernel to either crash
>> with a bad page on the next process that is created, have a null
>> pointer error, or have a recursion error.
> Can you create a readable kernel backtrace with ksymoops?

I can, but it is never in the batman process, which is why I didn't
think it was batman until I figured out how to reproduce it.  For
example:
=====================================
root@SchaferRobotics_1_3:/# batmand -c -g 11000
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the lat
est stable release !
root@SchaferRobotics_1_3:/# batmand -c
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the lat
est stable release !
batmand -g 12MBit/1536KBit -a 10.1.3.0/24 -a 10.255.1.3/32 -d 3
--hop-penalty 5 --purge-timeout 10000
ath0 eth0
root@SchaferRobotics_1_3:/# batmand -c -r 2
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the lat
est stable release !
Bad page state in process 'volts_temp'
page:c0335440 flags:0x00000000 mapping:00000000 mapcount:0 count:-1
Trying to fix it up, but a reboot is needed
Backtrace:
[<c0028680>] (dump_stack+0x0/0x14) from [<c0064a08>] (bad_page+0x74/0xb4)
[<c0064994>] (bad_page+0x0/0xb4) from [<c0065a0c>]
(get_page_from_freelist+0x45c/0x4a0)
 r6:c02bd7e8 r5:c02be02c r4:c0335440
[<c00655b0>] (get_page_from_freelist+0x0/0x4a0) from [<c0065afc>]
(__alloc_pages_internal+0xac/0x3e0)
[<c0065a50>] (__alloc_pages_internal+0x0/0x3e0) from [<c0065e50>]
(__get_free_pages+0x20/0x54)
[<c0065e30>] (__get_free_pages+0x0/0x54) from [<c0033af4>]
(copy_process+0x90/0xd40)
[<c0033a64>] (copy_process+0x0/0xd40) from [<c0034924>] (do_fork+0x70/0x2a4)
[<c00348b4>] (do_fork+0x0/0x2a4) from [<c0027c00>] (sys_fork+0x30/0x38)
[<c0027bd0>] (sys_fork+0x0/0x38) from [<c0024de0>] (ret_fast_syscall+0x0/0x2c)
=====================================
volts_temp, in this case, happens to be the next process that tried to
run.  I get a similar trace even if it is another process.

>> If I run batman without batgat, I don't get any crashes.
>>
>> Everything works fine otherwise.  Except one thing that just came to
>> mind, I had to remove -DDEBUG_MALLOC -DMEMORY_USAGE because batman
>> wouldn't do anything without crashing because of magic number
>> problems.  Could this be because I am on Big Endian hardware?
> I am running it also on big endian hardware and it seems to work. Does it
> happen right after the start or were extra interaction needed? What was the
> error output?

It happens right after the start, and the error is debugRealloc -
invalid magic number in trailer.

>> Could anyone else see if they have the same problem?  All you have to
>> do is have batman running with batgat installed, start issuing batmand
>> -c -g 11000 ; batmand -c -r 2 multiple times and see if their system
>> stays stable.
> I am running it in a while true loop since an hour on x86 and mips on isolated
> and non isolated (single partner) nodes and didn't get such problems.

Here is a little more on our setup:

All boards run the same software.  Each board has 2 mesh interfaces.
One is a radio, one is wired.  So, batman runs on 2 interfaces on
every board.
Each board has a downstream wired interface with a dhcp server.
batman announces this network.
This downstream network is different for every board due to a
group/node numbering scheme.  The network is 10.group.node.0/24.
Group and Node are 1-250.  The wireless interface is 10.0.group.node,
and the wired interface is 10.255.group.node.

A board can have an optional second radio, and if it does, it is used
to try to find an open wireless access point.
A board can also have an optional cellular modem and will try to use
it if it does.

If a default route gets set by one of these options, batmand -c -g is
used.  If the default route goes away, -c -r is used.

The boards are then either used as a mesh network extender, to provide
access to the mesh to a computer, or attached to a mobile platform
which can be controlled from any computer with access to the mesh.

The --hop-penalty of 5 was tested to be the best value for a mobile
platform just on the edge of needing to hop.

The --purge-timeout of 10000 is so that any boards that have been
turned off don't hang around long.

The current setup I am testing is 3 boards.  1 in the middle has a
wireless connection to one and a wired connection to the other.
The node in the middle has the optional wireless uplink.
The node connected via wired has the optional cellular uplink.

I appreciate you trying it out.  I'll try looking a bit deeper.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-19 20:38   ` Nathan Wharton
@ 2009-05-20  1:30     ` Marek Lindner
  2009-05-20 14:34       ` Nathan Wharton
  0 siblings, 1 reply; 24+ messages in thread
From: Marek Lindner @ 2009-05-20  1:30 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Wednesday 20 May 2009 04:38:31 Nathan Wharton wrote:
> >> Everything works fine otherwise.  Except one thing that just came to
> >> mind, I had to remove -DDEBUG_MALLOC -DMEMORY_USAGE because batman
> >> wouldn't do anything without crashing because of magic number
> >> problems.  Could this be because I am on Big Endian hardware?
> >
> > I am running it also on big endian hardware and it seems to work. Does it
> > happen right after the start or were extra interaction needed? What was
> > the error output?
>
> It happens right after the start, and the error is debugRealloc -
> invalid magic number in trailer.

The DEBUG_MALLOC option enables additional functions within batman that allow 
it to easily trace back malloc bugs. A simple core dump might help but 
sometimes it is hard to say where it is coming from. Could you post the exact 
"invalid trailer number" line ?

I'm a bit confused here:
* Is batman crashing ?
* Is the kernel crashing ?
* Is batman crashing if you use the batgat module ?

Regards,
Marek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-20  1:30     ` Marek Lindner
@ 2009-05-20 14:34       ` Nathan Wharton
  2009-05-20 16:10         ` Marek Lindner
  0 siblings, 1 reply; 24+ messages in thread
From: Nathan Wharton @ 2009-05-20 14:34 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Tue, May 19, 2009 at 8:30 PM, Marek Lindner <lindner_marek@yahoo.de> wrote:
> On Wednesday 20 May 2009 04:38:31 Nathan Wharton wrote:
>> >> Everything works fine otherwise.  Except one thing that just came to
>> >> mind, I had to remove -DDEBUG_MALLOC -DMEMORY_USAGE because batman
>> >> wouldn't do anything without crashing because of magic number
>> >> problems.  Could this be because I am on Big Endian hardware?
>> >
>> > I am running it also on big endian hardware and it seems to work. Does it
>> > happen right after the start or were extra interaction needed? What was
>> > the error output?
>>
>> It happens right after the start, and the error is debugRealloc -
>> invalid magic number in trailer.
>
> The DEBUG_MALLOC option enables additional functions within batman that allow
> it to easily trace back malloc bugs. A simple core dump might help but
> sometimes it is hard to say where it is coming from. Could you post the exact
> "invalid trailer number" line ?

Here is the output when I have DEBUG_MALLOC on from start to finish:
=========================================
root@SchaferRobotics_1_3:/# batmand -d 3 -r 2 -a 10.1.3.0/24 --disable-client-na
t --hop-penalty 5 --purge-timeout 10000 ath0 eth0
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the lat
est stable release !
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Interface activated: ath0
Using interface ath0 with address 10.0.1.3 and broadcast address 10.0.255.255
Interface activated: eth0
Using interface eth0 with address 10.255.1.3 and broadcast address
10.255.255.255
B.A.T.M.A.N. 0.3.2-beta rv1256 (compatibility version 5)
Adding throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
Adding throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
debug level: 3
routing class: 2
Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - unknown)
Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0 (table 68): File exists
debugRealloc - invalid magic number in trailer: 78183456, malloc tag = 15
Deleting throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
Deleting throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
Interface deactivated: ath0
Interface deactivated: eth0
=========================================

> I'm a bit confused here:
> * Is batman crashing ?
> * Is the kernel crashing ?
> * Is batman crashing if you use the batgat module ?

This is only happening when using the batgat module.
The kernel is crashing.  If it happens to not reboot, I see that
batman is in a device wait state and can't be killed.

I see that some more patches were added recently.  I will try them and
see if anything changes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-20 14:34       ` Nathan Wharton
@ 2009-05-20 16:10         ` Marek Lindner
  2009-05-20 17:01           ` Nathan Wharton
  0 siblings, 1 reply; 24+ messages in thread
From: Marek Lindner @ 2009-05-20 16:10 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Wednesday 20 May 2009 22:34:29 Nathan Wharton wrote:
> Here is the output when I have DEBUG_MALLOC on from start to finish:
> =========================================
> root@SchaferRobotics_1_3:/# batmand -d 3 -r 2 -a 10.1.3.0/24
> --disable-client-na t --hop-penalty 5 --purge-timeout 10000 ath0 eth0
> WARNING: You are using the unstable batman branch. If you are
> interested in *using* batman get the lat
> est stable release !
> B.A.T.M.A.N. 0.3.2-beta rv1256 (compatibility version 5)

Thanks, that helps.


> Adding throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
> Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
> Adding throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
> debug level: 3
> routing class: 2
> Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
> Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
> Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
> Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - unknown)
> Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0 (table 68): File
> exists debugRealloc - invalid magic number in trailer: 78183456, malloc tag
> = 15 Deleting throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
> Deleting throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)

Ok, from your output plus the nvalid magic number we can say that it seems 
somewhat related to your HNA settings. A few more questions:
* In this case the batgat module is not involved and still it crashes ?!
* Is your network up & running ? Does batman receive messages from neighbor 
nodes (you can track that via debug log 4) ?
* Does batman also crash in a disconnected environment ?


> This is only happening when using the batgat module.
> The kernel is crashing.  If it happens to not reboot, I see that
> batman is in a device wait state and can't be killed.

The log you just provided is not about a kernel crash - its "just" the batman 
daemon. Are we hunting 2 different bugs ?


> I see that some more patches were added recently.  I will try them and
> see if anything changes.

Ok, keep us posted.

Regards,
Marek



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-20 16:10         ` Marek Lindner
@ 2009-05-20 17:01           ` Nathan Wharton
  2009-05-20 19:02             ` Marek Lindner
  0 siblings, 1 reply; 24+ messages in thread
From: Nathan Wharton @ 2009-05-20 17:01 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Wed, May 20, 2009 at 11:10 AM, Marek Lindner <lindner_marek@yahoo.de> wrote:
> Ok, from your output plus the nvalid magic number we can say that it seems
> somewhat related to your HNA settings. A few more questions:
> * In this case the batgat module is not involved and still it crashes ?!
> * Is your network up & running ? Does batman receive messages from neighbor
> nodes (you can track that via debug log 4) ?
> * Does batman also crash in a disconnected environment ?

In this case, it does the same thing whether or not batgat is installed.

Debug level 4 gives:
========================================
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the lat
est stable release !
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Interface activated: ath0
Using interface ath0 with address 10.0.1.3 and broadcast address 10.0.255.255
Interface activated: eth0
Using interface eth0 with address 10.255.1.3 and broadcast address
10.255.255.255
B.A.T.M.A.N. 0.3.2-beta rv1256 (compatibility version 5)
[        30] Adding throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
[        30] Adding throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
debug level: 4
routing class: 2
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - unknown)
[        30] Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0
(table 68): File exists
[        30] Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0
(table 68): File exists
[        30] debugRealloc - invalid magic number in trailer: 78183456,
malloc tag = 15
[        30] debugRealloc - invalid magic number in trailer: 78183456,
malloc tag = 15
[        30] Deleting throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
[       100] Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
[       130] Deleting throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
========================================

It does this while not connected.

>> This is only happening when using the batgat module.
>> The kernel is crashing.  If it happens to not reboot, I see that
>> batman is in a device wait state and can't be killed.
>
> The log you just provided is not about a kernel crash - its "just" the batman
> daemon. Are we hunting 2 different bugs ?

If you consider 1 bug being the debug_malloc stuff not working, and
the other being batgat possibly crashing the kernel, then yes.
If I turn off debug malloc, then everything works fine, except using
batgat and going from gateway to routing class.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-20 17:01           ` Nathan Wharton
@ 2009-05-20 19:02             ` Marek Lindner
  2009-05-20 19:39               ` Nathan Wharton
  0 siblings, 1 reply; 24+ messages in thread
From: Marek Lindner @ 2009-05-20 19:02 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Thursday 21 May 2009 01:01:43 Nathan Wharton wrote:
> In this case, it does the same thing whether or not batgat is installed.

Ok.

I miss a couple of things in your output - do you use the plain sources from 
open-mesh.net or do you apply custom patches ?


> Debug level 4 gives:
> ========================================
> WARNING: You are using the unstable batman branch. If you are
> interested in *using* batman get the lat
> est stable release !
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
> Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)

Your log indicates that all routes are still present and batman tries to clean 
them up while starting. As you can see here table 68 is not mentioned. On my 
machine I get:

Deleting throw route to 105.0.0.0/8 via 0.0.0.0 (table 68 - unknown)


> [        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 65)
> [        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 66)

Here we lack the message that says we found a new HNA:
Adding HNA to announce network list: 105.0.0.0/8


> It does this while not connected.

I could make a patch that produces more debug output to get to the root of it 
but first we have to make sure we run the same code ...


> If you consider 1 bug being the debug_malloc stuff not working, and
> the other being batgat possibly crashing the kernel, then yes.
> If I turn off debug malloc, then everything works fine, except using
> batgat and going from gateway to routing class.

Ok, lets do the malloc stuff first and then we move to the batgat issue.

Just to be clear here: DEBUG_MALLOC is not the problem - it just makes the 
problem visible. Everytime batman allocates memory the debugger will allocate 
more than needed to add its debugging information. Now the debugging 
information gets overwritten and the debugger tells you that (including a 
direction towards the source of the problem). If you deactivate the debugger 
the memory will still be overwritten but you don't notice it! 
It can destroy arbitrary structures in the memory that need hours to lead to a 
crash (if it all). May be it leads to broken routing entries ..

Regards,
Marek



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Kernel crashes with batgat installed
  2009-05-20 19:02             ` Marek Lindner
@ 2009-05-20 19:39               ` Nathan Wharton
  0 siblings, 0 replies; 24+ messages in thread
From: Nathan Wharton @ 2009-05-20 19:39 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Wed, May 20, 2009 at 2:02 PM, Marek Lindner <lindner_marek@yahoo.de> wrote:
> I miss a couple of things in your output - do you use the plain sources from
> open-mesh.net or do you apply custom patches ?

I am using OpenWRT, and it doesn't have any patches.  It does get the
source from open-mesh.net.

> Your log indicates that all routes are still present and batman tries to clean
> them up while starting. As you can see here table 68 is not mentioned.
> ....
> Here we lack the message that says we found a new HNA:
> Adding HNA to announce network list: 105.0.0.0/8
> ....
> I could make a patch that produces more debug output to get to the root of it
> but first we have to make sure we run the same code ...

I don't know where the table 68 entries might have gone, or the HNA.

How about using 1269?  Here is the latest -d 4 output:
=========================================
WARNING: You are using the unstable batman branch. If you are
interested in *using* batman get the latest stable release !
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 68 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 68 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 67 - unknown)
Deleting throw route to 10.255.1.3/32 via 0.0.0.0 (table 67 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
Interface activated: ath0
Using interface ath0 with address 10.0.1.3 and broadcast address 10.0.255.255
Interface activated: eth0
Using interface eth0 with address 10.255.1.3 and broadcast address
10.255.255.255
B.A.T.M.A.N. 0.3.2-beta rv1269 (compatibility version 5)
[        30] Adding throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
[        30] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
[        30] Adding throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
debug level: 4
routing class: 2
[        30] schedule_own_packet(): ath0
[        30] schedule_own_packet(): eth0
[        30]
[       940]
[       950] Sending own packet (originator 10.255.1.3, seqno 1, TTL
2) on interface eth0
[       950] schedule_own_packet(): eth0
[       950]
[       950] Received BATMAN packet via NB: 10.255.1.3, IF: eth0
10.255.1.3 (from OG: 10.255.1.3, via old OG: 10.255.1.3, seqno 1, tq
255, TTL 2, V 5, IDF 0)
[       950] Drop packet: received my own broadcast (sender: 10.255.1.3)
[       950]
[      1010]
[      1020] Sending own packet (originator 10.0.1.3, seqno 1, TTL 50,
IDF off) on interface ath0
[      1020] Sending own packet (originator 10.0.1.3, seqno 1, TTL 50,
IDF off) on interface eth0
[      1020] schedule_own_packet(): ath0
[      1020]
[      1020] Received BATMAN packet via NB: 10.0.1.3, IF: ath0
10.0.1.3 (from OG: 10.0.1.3, via old OG: 10.0.1.3, seqno 1, tq 255,
TTL 50, V 5, IDF 0)
[      1020] Drop packet: received my own broadcast (sender: 10.0.1.3)
[      1020]
[      1020] Received BATMAN packet via NB: 10.255.1.3, IF: eth0
10.255.1.3 (from OG: 10.0.1.3, via old OG: 10.0.1.3, seqno 1, tq 255,
TTL 50, V 5, IDF 0)
[      1020] Drop packet: received my own broadcast (sender: 10.255.1.3)
[      1020]
[      1990]
[      2000] Sending own packet (originator 10.255.1.3, seqno 2, TTL
2) on interface eth0
[      2000] schedule_own_packet(): eth0
[      2000] ------------------ DEBUG ------------------
[      2000] Forward list
[      2000]     10.0.1.3 at 2022
[      2000]     10.255.1.3 at 2913
[      2000] Originator list
[      2000]   Originator  (#/255)         Nexthop [outgoingIF]:
Potential nexthops
[      2000] No batman nodes in range ...
[      2000] ---------------------------------------------- END DEBUG
[      2000] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 65 - unknown)
[      2000] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 66 - unknown)
[      2000] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 67 - unknown)
[      2000] Adding throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - unknown)
[      2000] Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0
(table 68): File exists
[      2000] Error - can't add throw route to 10.1.3.0/24 via 0.0.0.0
(table 68): File exists
[      2000] Adding throw route to 10.255.1.3/32 via 0.0.0.0 (table 65
- unknown)
[      2000] Adding throw route to 10.255.1.3/32 via 0.0.0.0 (table 66
- unknown)
[      2000] Adding throw route to 10.255.1.3/32 via 0.0.0.0 (table 67
- unknown)
[      2000] Adding throw route to 10.255.1.3/32 via 0.0.0.0 (table 68
- unknown)
[      2010] debugRealloc - invalid magic number in trailer: 78183456,
malloc tag = 15
[      2010] debugRealloc - invalid magic number in trailer: 78183456,
malloc tag = 15
[      2010] Deleting throw route to 127.0.0.0/8 via 0.0.0.0 (table 68 - lo)
[      2090] Deleting throw route to 10.1.3.0/24 via 0.0.0.0 (table 68 - eth1)
[      2130] Deleting throw route to 10.0.0.0/16 via 0.0.0.0 (table 68 - ath0)
=========================================

> Ok, lets do the malloc stuff first and then we move to the batgat issue.
>
> Just to be clear here: DEBUG_MALLOC is not the problem - it just makes the
> problem visible. Everytime batman allocates memory the debugger will allocate
> more than needed to add its debugging information. Now the debugging
> information gets overwritten and the debugger tells you that (including a
> direction towards the source of the problem). If you deactivate the debugger
> the memory will still be overwritten but you don't notice it!
> It can destroy arbitrary structures in the memory that need hours to lead to a
> crash (if it all). May be it leads to broken routing entries ..

That sounds good to me.  I had just turned it off to see if it was
just giving false errors, and everything ran fine until trying to do
something new with batgat.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-05-19 14:27 [B.A.T.M.A.N.] Kernel crashes with batgat installed Nathan Wharton
  2009-05-19 19:21 ` Sven Eckelmann
@ 2009-05-28 10:40 ` Sven Eckelmann
  2009-05-29  7:02   ` Marek Lindner
  2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 2/3] [batman] Make TYPE_OF_WORD the largest integral type Sven Eckelmann
  2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 3/3] [batman] Word-Align char buffer which are later casted to larger data types Sven Eckelmann
  3 siblings, 1 reply; 24+ messages in thread
From: Sven Eckelmann @ 2009-05-28 10:40 UTC (permalink / raw)
  To: b.a.t.m.a.n

Architectures with a special alignment for load and store operations on
datatypes bigger than bytes will return a prealigned memory region when
calling malloc. When we add our data structure before and after this
region we destroy this alignment.
To fix this problem we add special regions with "magic" padding data.
To be sure that it is big enough for every load/store operation we use
the alignment for uintmax_t or a pointer even when the architecture only
supports smaller load/store operations.

Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
---
 batman/allocate.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/batman/allocate.c b/batman/allocate.c
index 3cb1d65..a779504 100644
--- a/batman/allocate.c
+++ b/batman/allocate.c
@@ -67,6 +67,44 @@ struct memoryUsage
 };
 
 
+static size_t getHeaderPad() {
+	size_t pad = sizeof(uintmax_t) - (sizeof(struct chunkHeader) % sizeof(uintmax_t));
+	if (pad == sizeof(uintmax_t))
+		return 0;
+	else
+		return pad;
+}
+
+static size_t getTrailerPad(size_t length) {
+	size_t pad = sizeof(uintmax_t) - (length % sizeof(uintmax_t));
+	if (pad == sizeof(uintmax_t))
+		return 0;
+	else
+		return pad;
+}
+
+static void fillPadding(unsigned char* padding, size_t length) {
+	unsigned char c = 0x00;
+	size_t i;
+
+	for (i = 0; i < length; i++) {
+		c += 0xA7;
+		padding[i] = c;
+	}
+}
+
+static int checkPadding(unsigned char* padding, size_t length) {
+	unsigned char c = 0x00;
+	size_t i;
+
+	for (i = 0; i < length; i++) {
+		c += 0xA7;
+		if (padding[i] != c)
+			return 0;
+	}
+	return 1;
+}
+
 static void addMemory( uint32_t length, int32_t tag ) {
 
 	struct memoryUsage *walker;
@@ -176,7 +214,7 @@ void checkIntegrity(void)
 
 		memory = (unsigned char *)walker;
 
-		chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + walker->length);
+		chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + getHeaderPad() + walker->length + getTrailerPad(walker->length));
 
 		if (chunkTrailer->magicNumber != MAGIC_NUMBER)
 		{
@@ -209,7 +247,7 @@ void *debugMalloc(uint32_t length, int32_t tag)
 
 /* 	printf("sizeof(struct chunkHeader) = %u, sizeof (struct chunkTrailer) = %u\n", sizeof (struct chunkHeader), sizeof (struct chunkTrailer)); */
 
-	memory = malloc(length + sizeof(struct chunkHeader) + sizeof(struct chunkTrailer));
+	memory = malloc(length + sizeof(struct chunkHeader) + sizeof(struct chunkTrailer) + getHeaderPad() + getTrailerPad(length));
 
 	if (memory == NULL)
 	{
@@ -218,8 +256,11 @@ void *debugMalloc(uint32_t length, int32_t tag)
 	}
 
 	chunkHeader = (struct chunkHeader *)memory;
-	chunk = memory + sizeof(struct chunkHeader);
-	chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + length);
+	chunk = memory + sizeof(struct chunkHeader) + getHeaderPad();
+	chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + length + getHeaderPad() + getTrailerPad(length));
+
+	fillPadding((unsigned char*)chunkHeader + sizeof(struct chunkHeader), getHeaderPad());
+	fillPadding(chunk + length, getTrailerPad(length));
 
 	chunkHeader->length = length;
 	chunkHeader->tag = tag;
@@ -251,7 +292,7 @@ void *debugRealloc(void *memoryParameter, uint32_t length, int32_t tag)
 
 	if (memoryParameter) { /* if memoryParameter==NULL, realloc() should work like malloc() !! */
 		memory = memoryParameter;
-		chunkHeader = (struct chunkHeader *)(memory - sizeof(struct chunkHeader));
+		chunkHeader = (struct chunkHeader *)(memory - sizeof(struct chunkHeader) - getHeaderPad());
 
 		if (chunkHeader->magicNumber != MAGIC_NUMBER)
 		{
@@ -259,13 +300,23 @@ void *debugRealloc(void *memoryParameter, uint32_t length, int32_t tag)
 			restore_and_exit(0);
 		}
 
-		chunkTrailer = (struct chunkTrailer *)(memory + chunkHeader->length);
+		if (checkPadding(memory - getHeaderPad(), getHeaderPad()) == 0) {
+			debug_output( 0, "debugRealloc - invalid magic padding in header, malloc tag = %d\n", chunkHeader->tag );
+			restore_and_exit(0);
+		}
+
+		chunkTrailer = (struct chunkTrailer *)(memory + chunkHeader->length + getTrailerPad(chunkHeader->length));
 
 		if (chunkTrailer->magicNumber != MAGIC_NUMBER)
 		{
 			debug_output( 0, "debugRealloc - invalid magic number in trailer: %08x, malloc tag = %d\n", chunkTrailer->magicNumber, chunkHeader->tag );
 			restore_and_exit(0);
 		}
+
+		if (checkPadding(memory + chunkHeader->length, getTrailerPad(chunkHeader->length)) == 0) {
+			debug_output( 0, "debugRealloc - invalid magic padding in trailer, malloc tag = %d\n", chunkHeader->tag );
+			restore_and_exit(0);
+		}
 	}
 
 
@@ -292,7 +343,7 @@ void debugFree(void *memoryParameter, int tag)
 	struct chunkHeader *previous;
 
 	memory = memoryParameter;
-	chunkHeader = (struct chunkHeader *)(memory - sizeof(struct chunkHeader));
+	chunkHeader = (struct chunkHeader *)(memory - sizeof(struct chunkHeader) - getHeaderPad());
 
 	if (chunkHeader->magicNumber != MAGIC_NUMBER)
 	{
@@ -300,6 +351,11 @@ void debugFree(void *memoryParameter, int tag)
 		restore_and_exit(0);
 	}
 
+	if (checkPadding(memory - getHeaderPad(), getHeaderPad()) == 0) {
+		debug_output( 0, "debugFree - invalid magic padding in header, malloc tag = %d\n", chunkHeader->tag );
+		restore_and_exit(0);
+	}
+
 	previous = NULL;
 
 	pthread_mutex_lock(&chunk_mutex);
@@ -326,7 +382,7 @@ void debugFree(void *memoryParameter, int tag)
 
 	pthread_mutex_unlock(&chunk_mutex);
 
-	chunkTrailer = (struct chunkTrailer *)(memory + chunkHeader->length);
+	chunkTrailer = (struct chunkTrailer *)(memory + chunkHeader->length + getTrailerPad(chunkHeader->length));
 
 	if (chunkTrailer->magicNumber != MAGIC_NUMBER)
 	{
@@ -334,6 +390,11 @@ void debugFree(void *memoryParameter, int tag)
 		restore_and_exit(0);
 	}
 
+	if (checkPadding(memory + chunkHeader->length, getTrailerPad(chunkHeader->length)) == 0) {
+		debug_output( 0, "debugFree - invalid magic padding in trailer, malloc tag = %d\n", chunkHeader->tag );
+		restore_and_exit(0);
+	}
+
 #if defined MEMORY_USAGE
 
 	removeMemory( chunkHeader->tag, tag );
-- 
1.6.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [B.A.T.M.A.N.] [PATCH 2/3] [batman] Make TYPE_OF_WORD the largest integral type
  2009-05-19 14:27 [B.A.T.M.A.N.] Kernel crashes with batgat installed Nathan Wharton
  2009-05-19 19:21 ` Sven Eckelmann
  2009-05-28 10:40 ` [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures Sven Eckelmann
@ 2009-05-28 11:36 ` Sven Eckelmann
  2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 3/3] [batman] Word-Align char buffer which are later casted to larger data types Sven Eckelmann
  3 siblings, 0 replies; 24+ messages in thread
From: Sven Eckelmann @ 2009-05-28 11:36 UTC (permalink / raw)
  To: b.a.t.m.a.n


Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
---
 batman/allocate.c |   24 ++++++++++++++++++++----
 batman/batman.h   |    2 ++
 batman/bitarray.h |    4 +---
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/batman/allocate.c b/batman/allocate.c
index a779504..5e28c71 100644
--- a/batman/allocate.c
+++ b/batman/allocate.c
@@ -68,16 +68,32 @@ struct memoryUsage
 
 
 static size_t getHeaderPad() {
-	size_t pad = sizeof(uintmax_t) - (sizeof(struct chunkHeader) % sizeof(uintmax_t));
-	if (pad == sizeof(uintmax_t))
+	size_t alignwith, pad;
+
+	if (sizeof(TYPE_OF_WORD) > sizeof(void*))
+		alignwith = sizeof(TYPE_OF_WORD);
+	else
+		alignwith = sizeof(void*);
+
+	pad = alignwith - (sizeof(struct chunkHeader) % alignwith);
+
+	if (pad == alignwith)
 		return 0;
 	else
 		return pad;
 }
 
 static size_t getTrailerPad(size_t length) {
-	size_t pad = sizeof(uintmax_t) - (length % sizeof(uintmax_t));
-	if (pad == sizeof(uintmax_t))
+	size_t alignwith, pad;
+
+	if (sizeof(TYPE_OF_WORD) > sizeof(void*))
+		alignwith = sizeof(TYPE_OF_WORD);
+	else
+		alignwith = sizeof(void*);
+
+	pad = alignwith - (length % alignwith);
+
+	if (pad == alignwith)
 		return 0;
 	else
 		return pad;
diff --git a/batman/batman.h b/batman/batman.h
index c02ce8d..1cc5896 100644
--- a/batman/batman.h
+++ b/batman/batman.h
@@ -31,6 +31,8 @@
 #include <stdint.h>
 #include <stdio.h>
 
+#define TYPE_OF_WORD uintmax_t /* you should choose something big, if you don't want to waste cpu */
+
 #include "list-batman.h"
 #include "bitarray.h"
 #include "hash.h"
diff --git a/batman/bitarray.h b/batman/bitarray.h
index 5472ef1..0bb0710 100644
--- a/batman/bitarray.h
+++ b/batman/bitarray.h
@@ -21,10 +21,8 @@
 
 
 
-#define TYPE_OF_WORD unsigned long /* you should choose something big, if you don't want to waste cpu */
-#define WORD_BIT_SIZE ( sizeof(TYPE_OF_WORD) * 8 )
 #include "batman.h"
-
+#define WORD_BIT_SIZE ( sizeof(TYPE_OF_WORD) * 8 )
 
 
 void bit_init( TYPE_OF_WORD *seq_bits );
-- 
1.6.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [B.A.T.M.A.N.] [PATCH 3/3] [batman] Word-Align char buffer which are later casted to larger data types
  2009-05-19 14:27 [B.A.T.M.A.N.] Kernel crashes with batgat installed Nathan Wharton
                   ` (2 preceding siblings ...)
  2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 2/3] [batman] Make TYPE_OF_WORD the largest integral type Sven Eckelmann
@ 2009-05-28 11:36 ` Sven Eckelmann
  3 siblings, 0 replies; 24+ messages in thread
From: Sven Eckelmann @ 2009-05-28 11:36 UTC (permalink / raw)
  To: b.a.t.m.a.n

Buffers of char must not be special aligned on all architecture, but
if the compiler will not know about missing alignment of the larger
data type it generate unsafe instructions as it assumes that they
are word aligned.

Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
---
 batman/batman.h       |    1 +
 batman/linux/route.c  |    6 +++---
 batman/posix/tunnel.c |    2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/batman/batman.h b/batman/batman.h
index 1cc5896..d6b00cf 100644
--- a/batman/batman.h
+++ b/batman/batman.h
@@ -152,6 +152,7 @@
 
 
 #define BATMANUNUSED(x) (x)__attribute__((unused))
+#define ALIGN_WORD __attribute__ ((aligned(sizeof(TYPE_OF_WORD))))
 
 
 
diff --git a/batman/linux/route.c b/batman/linux/route.c
index 4d46955..0c7b932 100644
--- a/batman/linux/route.c
+++ b/batman/linux/route.c
@@ -185,7 +185,7 @@ void add_del_route(uint32_t dest, uint8_t netmask, uint32_t router, uint32_t src
 		struct rtmsg rtm;
 		char buff[4 * (sizeof(struct rtattr) + 4)];
 	} *req;
-	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))];
+	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))] ALIGN_WORD;
 
 	iov.iov_base = buf;
 	iov.iov_len  = sizeof(buf);
@@ -369,7 +369,7 @@ void add_del_rule(uint32_t network, uint8_t netmask, int8_t rt_table, uint32_t p
 		struct rtmsg rtm;
 		char buff[2 * (sizeof(struct rtattr) + 4)];
 	} *req;
-	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))];
+	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))] ALIGN_WORD;
 
 	iov.iov_base = buf;
 	iov.iov_len  = sizeof(buf);
@@ -634,7 +634,7 @@ int flush_routes_rules(int8_t is_rule)
 	struct req_s {
 		struct rtmsg rtm;
 	} *req;
-	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))];
+	char req_buf[NLMSG_LENGTH(sizeof(struct req_s))] ALIGN_WORD;
 
 	struct rtattr *rtap;
 
diff --git a/batman/posix/tunnel.c b/batman/posix/tunnel.c
index 4263794..1cfb501 100644
--- a/batman/posix/tunnel.c
+++ b/batman/posix/tunnel.c
@@ -567,7 +567,7 @@ void *gw_listen(void *BATMANUNUSED(arg)) {
 	unsigned char buff[1501];
 	int32_t res, max_sock, buff_len, tun_fd, tun_ifi;
 	uint32_t addr_len, client_timeout, current_time;
-	uint8_t my_tun_ip[4], next_free_ip[4];
+	uint8_t my_tun_ip[4] ALIGN_WORD, next_free_ip[4] ALIGN_WORD;
 	struct hashtable_t *wip_hash, *vip_hash;
 	struct list_head_first free_ip_list;
 	fd_set wait_sockets, tmp_wait_sockets;
-- 
1.6.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-05-28 10:40 ` [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures Sven Eckelmann
@ 2009-05-29  7:02   ` Marek Lindner
  2009-05-29 14:00     ` Nathan Wharton
  0 siblings, 1 reply; 24+ messages in thread
From: Marek Lindner @ 2009-05-29  7:02 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Thursday 28 May 2009 18:40:08 Sven Eckelmann wrote:
> Architectures with a special alignment for load and store operations on
> datatypes bigger than bytes will return a prealigned memory region when
> calling malloc. When we add our data structure before and after this
> region we destroy this alignment.
> To fix this problem we add special regions with "magic" padding data.
> To be sure that it is big enough for every load/store operation we use
> the alignment for uintmax_t or a pointer even when the architecture only
> supports smaller load/store operations.
>
> Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>

@Nathan: Could you let me know if these patches work for you ? If so I'll 
commit them. 

Regards,
Marek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-05-29  7:02   ` Marek Lindner
@ 2009-05-29 14:00     ` Nathan Wharton
  2009-06-01 16:44       ` Sven Eckelmann
  0 siblings, 1 reply; 24+ messages in thread
From: Nathan Wharton @ 2009-05-29 14:00 UTC (permalink / raw)
  To: Marek Lindner; +Cc: The list for a Better Approach To Mobile Ad-hoc Networking

On Fri, May 29, 2009 at 2:02 AM, Marek Lindner <lindner_marek@yahoo.de> wrote:
> On Thursday 28 May 2009 18:40:08 Sven Eckelmann wrote:
>> Architectures with a special alignment for load and store operations on
>> datatypes bigger than bytes will return a prealigned memory region when
>> calling malloc. When we add our data structure before and after this
>> region we destroy this alignment.
>> To fix this problem we add special regions with "magic" padding data.
>> To be sure that it is big enough for every load/store operation we use
>> the alignment for uintmax_t or a pointer even when the architecture only
>> supports smaller load/store operations.
>>
>> Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
>
> @Nathan: Could you let me know if these patches work for you ? If so I'll
> commit them.
>
> Regards,
> Marek
>
>

I set /proc/cpu/alignment to 4 (raise bus error) and I get a bus error:

Program received signal SIGBUS, Bus error.
list_add_tail (new=0x29368, head=0x28819) at list-batman.c:68
68              __list_add( new, head->prev, (struct list_head *)head );
(gdb) l
63       * Insert a new entry before the specified head.
64       * This is useful for implementing queues.
65       */
66      void list_add_tail( struct list_head *new, struct
list_head_first *head ) {
67
68              __list_add( new, head->prev, (struct list_head *)head );
69
70              head->prev = new;
71
72      }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-05-29 14:00     ` Nathan Wharton
@ 2009-06-01 16:44       ` Sven Eckelmann
  2009-06-01 18:03         ` Nathan Wharton
  0 siblings, 1 reply; 24+ messages in thread
From: Sven Eckelmann @ 2009-06-01 16:44 UTC (permalink / raw)
  To: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

On Friday 29 May 2009 16:00:40 Nathan Wharton wrote:
> > @Nathan: Could you let me know if these patches work for you ? If so I'll
> > commit them.
> >
> > Regards,
> > Marek
>
> I set /proc/cpu/alignment to 4 (raise bus error) and I get a bus error:
>
> Program received signal SIGBUS, Bus error.
> list_add_tail (new=0x29368, head=0x28819) at list-batman.c:68
> 68              __list_add( new, head->prev, (struct list_head *)head );
> (gdb) l
> 63       * Insert a new entry before the specified head.
> 64       * This is useful for implementing queues.
> 65       */
> 66      void list_add_tail( struct list_head *new, struct
> list_head_first *head ) {
> 67
> 68              __list_add( new, head->prev, (struct list_head *)head );
> 69
> 70              head->prev = new;
> 71
> 72      }
Have you added the patches per hand? At this moment no patch I've made 
available in trunk. As you have run it with gdb, can you please append a full 
backtrace?

Best regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-06-01 16:44       ` Sven Eckelmann
@ 2009-06-01 18:03         ` Nathan Wharton
  2009-06-01 19:35           ` Sven Eckelmann
  0 siblings, 1 reply; 24+ messages in thread
From: Nathan Wharton @ 2009-06-01 18:03 UTC (permalink / raw)
  To: Sven Eckelmann; +Cc: b.a.t.m.a.n

On Mon, Jun 1, 2009 at 11:44 AM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> On Friday 29 May 2009 16:00:40 Nathan Wharton wrote:
>> > @Nathan: Could you let me know if these patches work for you ? If so I'll
>> > commit them.
>> >
>> > Regards,
>> > Marek
>>
>> I set /proc/cpu/alignment to 4 (raise bus error) and I get a bus error:
>>
>> Program received signal SIGBUS, Bus error.
>> list_add_tail (new=0x29368, head=0x28819) at list-batman.c:68
>> 68              __list_add( new, head->prev, (struct list_head *)head );
>> (gdb) l
>> 63       * Insert a new entry before the specified head.
>> 64       * This is useful for implementing queues.
>> 65       */
>> 66      void list_add_tail( struct list_head *new, struct
>> list_head_first *head ) {
>> 67
>> 68              __list_add( new, head->prev, (struct list_head *)head );
>> 69
>> 70              head->prev = new;
>> 71
>> 72      }
> Have you added the patches per hand? At this moment no patch I've made
> available in trunk. As you have run it with gdb, can you please append a full
> backtrace?
>
> Best regards,
>        Sven
>

I had to copy the patches out of the e-mail.

Here is the back trace:
#0  list_add_tail (new=0x29bf0, head=0x298c9) at list-batman.c:68
#1  0x0000ee7c in _hna_global_add (orig_node=0x29f80,
hna_element=0x29ba8) at hna.c:371
#2  0x0000f160 in hna_global_add (orig_node=0x29f80, new_hna=<value
optimized out>, new_hna_len=<value optimized out>)
    at hna.c:529
#3  0x000099c8 in update_routes (orig_node=0x29f80,
neigh_node=0x2a080, hna_recv_buff=0xbead1591 "\n\002\001",
    hna_buff_len=10) at batman.c:377
#4  0x0000c730 in update_orig (orig_node=0x29f80, in=0xbead157f,
neigh=167772673, if_incoming=0x27678,
    hna_recv_buff=0xbead1591 "\n\002\001", hna_buff_len=-16723,
is_duplicate=0 '\0', curr_time=3199014207)
    at originator.c:227
#5  0x0000a7e0 in batman () at batman.c:956
#6  0x000148d4 in main (argc=14, argv=0xbead1e14) at posix/posix.c:629

Looks like debugMalloc didn't return an aligned value for head.  I'll
step through that and see what I see.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-06-01 18:03         ` Nathan Wharton
@ 2009-06-01 19:35           ` Sven Eckelmann
  2009-06-01 21:50             ` Nathan Wharton
  2009-06-02  4:36             ` Marek Lindner
  0 siblings, 2 replies; 24+ messages in thread
From: Sven Eckelmann @ 2009-06-01 19:35 UTC (permalink / raw)
  To: Nathan Wharton; +Cc: b.a.t.m.a.n, Marek Lindner

[-- Attachment #1: Type: text/plain, Size: 2021 bytes --]

On Monday 01 June 2009 20:03:43 Nathan Wharton wrote:
> I had to copy the patches out of the e-mail.
>
> Here is the back trace:
> #0  list_add_tail (new=0x29bf0, head=0x298c9) at list-batman.c:68
> #1  0x0000ee7c in _hna_global_add (orig_node=0x29f80,
> hna_element=0x29ba8) at hna.c:371
> #2  0x0000f160 in hna_global_add (orig_node=0x29f80, new_hna=<value
> optimized out>, new_hna_len=<value optimized out>)
>     at hna.c:529
> #3  0x000099c8 in update_routes (orig_node=0x29f80,
> neigh_node=0x2a080, hna_recv_buff=0xbead1591 "\n\002\001",
>     hna_buff_len=10) at batman.c:377
> #4  0x0000c730 in update_orig (orig_node=0x29f80, in=0xbead157f,
> neigh=167772673, if_incoming=0x27678,
>     hna_recv_buff=0xbead1591 "\n\002\001", hna_buff_len=-16723,
> is_duplicate=0 '\0', curr_time=3199014207)
>     at originator.c:227
> #5  0x0000a7e0 in batman () at batman.c:956
> #6  0x000148d4 in main (argc=14, argv=0xbead1e14) at posix/posix.c:629
>
> Looks like debugMalloc didn't return an aligned value for head.  I'll
> step through that and see what I see.
Ok, I think I see the problem. The malloc returned a valid aligned adress. 
list_add_tail will get a pointer to an element in hna_global_entry. This 
structure is packed and all operations on it should be non-alignment safe. If 
you look at it further you will notice that orig_list is at position 9 
(assuming 4 bytes for a pointer) - which will not be aligned to 4 bytes of 
course.....
And here comes the problem: the compiler will only do the safe operations on 
non-aligned data if it knows that it is not alignent. Since a cast is done by 
calling list_add_tail it will not know that this parameter is not aligned and 
the non-alignment bug will occur.

So my question to marek: Is it really needed to have "struct hna_global_entry" 
packed in hna.h:57? If not then we should remove it and this problem should be 
gone. And what is with "struct hna_element".

Thank you for your work, Nathan :)

Regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-06-01 19:35           ` Sven Eckelmann
@ 2009-06-01 21:50             ` Nathan Wharton
  2009-06-02  4:36             ` Marek Lindner
  1 sibling, 0 replies; 24+ messages in thread
From: Nathan Wharton @ 2009-06-01 21:50 UTC (permalink / raw)
  To: Sven Eckelmann; +Cc: b.a.t.m.a.n, Marek Lindner

On Mon, Jun 1, 2009 at 2:35 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> Ok, I think I see the problem. The malloc returned a valid aligned adress.
> list_add_tail will get a pointer to an element in hna_global_entry. This
> structure is packed and all operations on it should be non-alignment safe. If
> you look at it further you will notice that orig_list is at position 9
> (assuming 4 bytes for a pointer) - which will not be aligned to 4 bytes of
> course.....
> And here comes the problem: the compiler will only do the safe operations on
> non-aligned data if it knows that it is not alignent. Since a cast is done by
> calling list_add_tail it will not know that this parameter is not aligned and
> the non-alignment bug will occur.
>
> So my question to marek: Is it really needed to have "struct hna_global_entry"
> packed in hna.h:57? If not then we should remove it and this problem should be
> gone. And what is with "struct hna_element".
>
> Thank you for your work, Nathan :)

You are welcome, thanks for your help.

The crashing of batgat on unloading turns out to be socket 4306 not
being ready to be reused yet.
I worked around this by using:
batmand -c -r 0 ; sleep 1 ; batmand -c -g 11000
and
batmand -c -g 0 ; sleep 1 ; batmand -c -r 2

Marek helped figure that out on irc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures
  2009-06-01 19:35           ` Sven Eckelmann
  2009-06-01 21:50             ` Nathan Wharton
@ 2009-06-02  4:36             ` Marek Lindner
  2009-06-02 17:50               ` [B.A.T.M.A.N.] " Sven Eckelmann
  1 sibling, 1 reply; 24+ messages in thread
From: Marek Lindner @ 2009-06-02  4:36 UTC (permalink / raw)
  To: b.a.t.m.a.n

On Tuesday 02 June 2009 03:35:07 Sven Eckelmann wrote:
> So my question to marek: Is it really needed to have "struct
> hna_global_entry" packed in hna.h:57? If not then we should remove it and
> this problem should be gone. And what is with "struct hna_element".

The first 5 bytes of both structs are used as base for the hash index. If the 
compiler changes the order or something similar it might not work.

Regards,
Marek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] Add padding around allocation debugger structures
  2009-06-02  4:36             ` Marek Lindner
@ 2009-06-02 17:50               ` Sven Eckelmann
  2009-06-02 17:56                 ` [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements Sven Eckelmann
  0 siblings, 1 reply; 24+ messages in thread
From: Sven Eckelmann @ 2009-06-02 17:50 UTC (permalink / raw)
  To: b.a.t.m.a.n; +Cc: Marek Lindner

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

On Tuesday 02 June 2009 06:36:41 Marek Lindner wrote:
> On Tuesday 02 June 2009 03:35:07 Sven Eckelmann wrote:
> > So my question to marek: Is it really needed to have "struct
> > hna_global_entry" packed in hna.h:57? If not then we should remove it and
> > this problem should be gone. And what is with "struct hna_element".
>
> The first 5 bytes of both structs are used as base for the hash index. If
> the compiler changes the order or something similar it might not work.
Ok, then it should be safe to force the alignment of the pointers in 
hna_global_entry. Everything else seems to be much more complicated and 
doesn't create much cleaner code.

Best regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements
  2009-06-02 17:50               ` [B.A.T.M.A.N.] " Sven Eckelmann
@ 2009-06-02 17:56                 ` Sven Eckelmann
  2009-06-02 18:56                   ` Nathan Wharton
  2009-06-03 10:39                   ` [B.A.T.M.A.N.] [PATCHv2] " Sven Eckelmann
  0 siblings, 2 replies; 24+ messages in thread
From: Sven Eckelmann @ 2009-06-02 17:56 UTC (permalink / raw)
  To: b.a.t.m.a.n

Architectures like SuperARM or Xscale needs aligned data for multi-byte
operations. GCC can create instructions sequences for packed data, but
must know that something will not be aligned. Since list_add will
operate on untyped data over void-pointers it cannot know that
hna_global_entry is packed and will create only a fast and unsafe
version for load and store operations.
It is only important for the first 5 bytes of hna_global_entry to be
packed we can force these elements to be aligned without changing
the relative addresses of the first bytes.

Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
---
 batman/hna.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/batman/hna.h b/batman/hna.h
index 6063324..3e7049e 100644
--- a/batman/hna.h
+++ b/batman/hna.h
@@ -58,8 +58,8 @@ struct hna_global_entry
 {
 	uint32_t addr;
 	uint8_t netmask;
-	struct orig_node *curr_orig_node;
-	struct list_head_first orig_list;
+	struct orig_node *curr_orig_node ALIGN_WORD;
+	struct list_head_first orig_list ALIGN_WORD;
 } __attribute__((packed));
 
 struct hna_orig_ptr
-- 
1.6.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements
  2009-06-02 17:56                 ` [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements Sven Eckelmann
@ 2009-06-02 18:56                   ` Nathan Wharton
  2009-06-03 10:39                   ` [B.A.T.M.A.N.] [PATCHv2] " Sven Eckelmann
  1 sibling, 0 replies; 24+ messages in thread
From: Nathan Wharton @ 2009-06-02 18:56 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Tue, Jun 2, 2009 at 12:56 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> Architectures like SuperARM or Xscale needs aligned data for multi-byte
> operations. GCC can create instructions sequences for packed data, but
> must know that something will not be aligned. Since list_add will
> operate on untyped data over void-pointers it cannot know that
> hna_global_entry is packed and will create only a fast and unsafe
> version for load and store operations.
> It is only important for the first 5 bytes of hna_global_entry to be
> packed we can force these elements to be aligned without changing
> the relative addresses of the first bytes.
>

It looks good here.  I am running this combined with the previous 3
patches with cpu/alignment set to bus error on problems.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [B.A.T.M.A.N.] [PATCHv2] [batman] Align pointers in hna list elements
  2009-06-02 17:56                 ` [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements Sven Eckelmann
  2009-06-02 18:56                   ` Nathan Wharton
@ 2009-06-03 10:39                   ` Sven Eckelmann
  2009-06-03 11:16                     ` Marek Lindner
  1 sibling, 1 reply; 24+ messages in thread
From: Sven Eckelmann @ 2009-06-03 10:39 UTC (permalink / raw)
  To: b.a.t.m.a.n

Architectures like SuperARM or Xscale needs aligned data for multi-byte
operations. GCC can create instructions sequences for packed data, but
must know that something will not be aligned. Since list_add will
operate on untyped data over void-pointers it cannot know that
hna_global_entry is packed and will create only a fast and unsafe
version for load and store operations.
It is only important for the first 5 bytes of hna_global_entry to be
packed we can force these elements to be aligned without changing
the relative addresses of the first bytes.

Signed-off-by: Sven Eckelmann <sven.eckelmann@gmx.de>
---
 batman/batman.h |    1 +
 batman/hna.h    |    4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/batman/batman.h b/batman/batman.h
index d6b00cf..23f8e9a 100644
--- a/batman/batman.h
+++ b/batman/batman.h
@@ -153,6 +153,7 @@
 
 #define BATMANUNUSED(x) (x)__attribute__((unused))
 #define ALIGN_WORD __attribute__ ((aligned(sizeof(TYPE_OF_WORD))))
+#define ALIGN_POINTER __attribute__ ((aligned(sizeof(void*))))
 
 
 
diff --git a/batman/hna.h b/batman/hna.h
index 6063324..a046857 100644
--- a/batman/hna.h
+++ b/batman/hna.h
@@ -58,8 +58,8 @@ struct hna_global_entry
 {
 	uint32_t addr;
 	uint8_t netmask;
-	struct orig_node *curr_orig_node;
-	struct list_head_first orig_list;
+	struct orig_node *curr_orig_node ALIGN_POINTER;
+	struct list_head_first orig_list ALIGN_POINTER;
 } __attribute__((packed));
 
 struct hna_orig_ptr
-- 
1.6.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [B.A.T.M.A.N.] [PATCHv2] [batman] Align pointers in hna list elements
  2009-06-03 10:39                   ` [B.A.T.M.A.N.] [PATCHv2] " Sven Eckelmann
@ 2009-06-03 11:16                     ` Marek Lindner
  0 siblings, 0 replies; 24+ messages in thread
From: Marek Lindner @ 2009-06-03 11:16 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

On Wednesday 03 June 2009 18:39:26 Sven Eckelmann wrote:
> Architectures like SuperARM or Xscale needs aligned data for multi-byte
> operations. GCC can create instructions sequences for packed data, but
> must know that something will not be aligned. Since list_add will
> operate on untyped data over void-pointers it cannot know that
> hna_global_entry is packed and will create only a fast and unsafe
> version for load and store operations.
> It is only important for the first 5 bytes of hna_global_entry to be
> packed we can force these elements to be aligned without changing
> the relative addresses of the first bytes.


Sven, thanks a lot for your patches and thanks for your debugging help, 
Nathan. I just applied these patches. :-)

Regards,
Marek


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-06-03 11:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-19 14:27 [B.A.T.M.A.N.] Kernel crashes with batgat installed Nathan Wharton
2009-05-19 19:21 ` Sven Eckelmann
2009-05-19 20:38   ` Nathan Wharton
2009-05-20  1:30     ` Marek Lindner
2009-05-20 14:34       ` Nathan Wharton
2009-05-20 16:10         ` Marek Lindner
2009-05-20 17:01           ` Nathan Wharton
2009-05-20 19:02             ` Marek Lindner
2009-05-20 19:39               ` Nathan Wharton
2009-05-28 10:40 ` [B.A.T.M.A.N.] [PATCH] [batman] Add padding around allocation debugger structures Sven Eckelmann
2009-05-29  7:02   ` Marek Lindner
2009-05-29 14:00     ` Nathan Wharton
2009-06-01 16:44       ` Sven Eckelmann
2009-06-01 18:03         ` Nathan Wharton
2009-06-01 19:35           ` Sven Eckelmann
2009-06-01 21:50             ` Nathan Wharton
2009-06-02  4:36             ` Marek Lindner
2009-06-02 17:50               ` [B.A.T.M.A.N.] " Sven Eckelmann
2009-06-02 17:56                 ` [B.A.T.M.A.N.] [PATCH] [batman] Align pointers in hna list elements Sven Eckelmann
2009-06-02 18:56                   ` Nathan Wharton
2009-06-03 10:39                   ` [B.A.T.M.A.N.] [PATCHv2] " Sven Eckelmann
2009-06-03 11:16                     ` Marek Lindner
2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 2/3] [batman] Make TYPE_OF_WORD the largest integral type Sven Eckelmann
2009-05-28 11:36 ` [B.A.T.M.A.N.] [PATCH 3/3] [batman] Word-Align char buffer which are later casted to larger data types Sven Eckelmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).