[klibc] Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time

Cyril Brulebois cyril.brulebois at kerlabs.com
Thu Jan 15 14:19:19 PST 2009


Package: klibc-utils
Version: 1.5.12-2
Severity: important
Tags: patch

(I'm using X-Debbugs-Cc so that the klibc list receives a copy directly.
 I'd be glad to be kept in Cc since I don't follow that list, thanks
 already.)

Hi maks, hpa, Louis, and others,

I've been experiencing for a while timeouts at DHCP-time in ipconfig
when starting up several machines (say: 2 out of 4 don't boot) at the
very same time (FWIW, using the 'reboot' mechanism of our SSI system,
which means the reboots happen in sync).

Finally, I found some time to look into it, and discovered that the
automata doesn't take some cases into account. First, I've rebuilt
ipconfig with DEBUG/IPC_DEBUG, and discovered that the machines that
don't boot receive the messages broadcasted by the other dhcp clients
(be it DISCOVER or REQUEST) and from that point on, receive nothing
else.

Now, looking at the code (under usr/kinit/ipconfig for those following
at home):

packet.c: packet_recv()
||         if (udp->source != htons(cfg_remote_port) ||
|             udp->dest != htons(cfg_local_port))
|                 goto free_pkt;
|| free_pkt:
|         free(ip);
|         return 0;
|
Which means in case of source/dest mismatch (which is the case when a
message from another client is received), 0 is returned.

Now, looking at the callers:
bootp_proto.c: bootp_recv_reply() & dhcp_proto.c: dhcp_recv()
||         ret = packet_recv(iov, 3);
|         if (ret <= 0)
|                 return ret;
|
Again, 0 is returned.

dhcp_recv() is wrapped into dhcp_recv_offer() & dhcp_recv_ack().

Finally, all those are used in switch() statements in main.c, where -1
and strictly positive values are checked for, but not 0. Hence the
attached patch: 0001-trivial.patch.

I guess one could consider it a special case that might deserve a
DEVST_SOFTERROR state, which could have a shorter retry delay than
DEVST_ERROR. Especially true for some setups with a hundred machines or
more, it'd be quite a PITA to wait 10 seconds for a retry where only a
couple of machines will complete the DHCP handshake. Not to mention the
default timeout that'll bite. That's why I'm proposing a second patch:
0002-introduce-softerror.patch; and since it's probably overkill to
introduce that additional state, I think the functionally equivalent
0003-cleaner.patch will be better if you want to implement my suggestion.

Patches against master branch, tested on Debian's sid version (1.5.12).

Errm, now that I'm rebooting on a loopy fashion, it looks like those
patches don't cure the problem totally, so I guess I'm back to
debugging. Hopefully, upstream will figure this out better than I do.

Cheers,
-- 
Cyril Brulebois
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 1667
Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment.mht 
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 3091
Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0001.mht 
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 2533
Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0002.mht 


More information about the klibc mailing list