Wednesday, December 31, 2014

Metadata failed: Network is unreachable


Yesterday I've got a rather strange problem: Our Debian image is not retrieving metadata and complaining 'no route to host' about each attempt to connect to the metadata server, but Cirros is working just fine.

It was more interesting because same Debian image works in our previous installation.

The main difference in configuration of those two installation (aside from the jump from Havana to Juno) is network configuration.

Our old configuration is typical: overlay network, private router per tenant, gateway to external network, floating IP.

New (lab) configuration is more daring: each tenant owns a private external network with provider router (inside vlan or vxlan). But OpenStack still provides DHCP and metadata services.

Subnet looks like this:

neutron subnet-create net --allocation-pool start=xx.xx.xx.130,end=xx.xx.xx.158 xx.xx.xx.128/27 --enable-dhcp --gateway=xx.xx.xx.129 --dns-nameserver=8.8.8.8

DHCP agent is set up for isolated networks.

Because of the default settings in the Debian image (accept only ssh key) it's really hard to debug (no metadata, no key, no access to the image). And the 'image for debugging' (Cirros) is working! What a pain in the stack.

After rebuilding the debug Debian image with hardcoded password (never do this in production!) I was able to log in to Debian and look around.

Cirros is working:
$ curl http://169.254.169.254/2009-04-04/meta-data/placement/availability-zone
nova

But Debian does not:
$ wget 169.254.169.254/2009-04-04/meta-data/placement/availability-zone
--2014-12-31 09:49:48-- http://169.254.169.254/2009-04-04/meta-data/placement/availability-zone
Connecting to 169.254.169.254:80... failed: Network is unreachable.

The difference is easy to spot: routing (ip route show)

Cirros:
default via xx.xx.xx.129 dev eth0
xx.xx.xx.128/27 dev eth0 src xx.xx.xx.132
169.254.169.254 via xx.xx.xx.131 dev eth0

Debian:
default via xx.xx.xx.129 dev eth0
xx.xx.xx.128/27 dev eth0 proto kernel scope link src xx.xx.xx.13


Note the lack of '169.254.169.254' line in Debian's version. xx.xx.xx.131 is DHCP/metadata IP. DHCP is to blame.

To prove this I need to snoop traffic to see what happens.

The best way to do this is go to the namespace of the DHCP agent and use tshark to look at the DHCP exchange.

Tshark is a console version of wireshark and it provides much more readable output than tcpdump.

neutron net-list:

+--------------------------------------+---------+------------------------------------------------------+

| id | name | subnets |

+--------------------------------------+---------+------------------------------------------------------+

| 40a7b4bf-8714-42c0-bed2-8720cd650958 | network | d4fbd66a-dbaa-436e-90ed-0260a5f7d265 88.85.77.128/27 |

+--------------------------------------+---------+------------------------------------------------------+

On the network node, go to the DHCP agent namespace:

stdbuf -e0 -o0 ip net exec qdhcp-40a7b4bf-8714-42c0-bed2-8720cd650958 /bin/bash

(stdbuf is used to disable buffering, otherwise the text on screen would appear with a delay)

'ip l' to see interfaces, and:

tshark -Vi tap3acc3114-45 -Pf 'udp'

here are the snippets from packets I saw:

Cirros:

request:

Parameter Request List Item: (1) Subnet Mask
Parameter Request List Item: (3) Router
Parameter Request List Item: (6) Domain Name Server
Parameter Request List Item: (12) Host Name
Parameter Request List Item: (15) Domain Name
Parameter Request List Item: (26) Interface MTU
Parameter Request List Item: (28) Broadcast Address
Parameter Request List Item: (42) Network Time Protocol Servers
Parameter Request List Item: (121) Classless Static Route

reply:

(skip)
Option: (3) Router
Length: 4
Router: xx.xx.xx.129 (xx.xx.xx.129)
Option: (121) Classless Static Route
Length: 14
Subnet/MaskWidth-Router: 169.254.169.254/32-xx.xx.xx.131
Subnet/MaskWidth-Router: default-xx.xx.xx.129
(skip)

From Debian:

request:
Parameter Request List Item: (1) Subnet Mask
Parameter Request List Item: (3) Router
Parameter Request List Item: (6) Domain Name Server
Parameter Request List Item: (15) Domain Name
Parameter Request List Item: (28) Broadcast Address
Parameter Request List Item: (12) Host Name
Parameter Request List Item: (7) Log Server
Parameter Request List Item: (9) LPR Server
Parameter Request List Item: (42) Network Time Protocol Servers
Parameter Request List Item: (48) X Window System Font Server
Parameter Request List Item: (49) X Window System Display Manager
Parameter Request List Item: (26) Interface MTU

With an obvious reply:

(skip)
Option: (3) Router
Length: 4
Router: 88.85.77.129 (xx.xx.77.129)
(skip)

Note the lack of '(121) Classless Static Route' in Debian's request. So the problem is in different requests.

After messing around a bit and talking with the guy who manages images, I found that we're using pump (https://packages.debian.org/wheezy/pump) as the DHCP client in Debian. And it is provided without the default config and it does not ask for additional routes.

So the problem was in the use of pump 'as is' without config. It was working in our previous configuration by pure luck because the metadata was served from router namespace and there were no additional routes at all.

Pump is guilty (and we are not, heh).

Conclusion:

We're going to either switch back to patched (hello, shellshock) isc-dhcp-client, or we'll try to fix the pump config.

Option 121 (Classless Static Route) is very important for OpenStack instances.

1 comment:

  1. Very helpful article on OpenStack failed to create network issue. Thanks for sharing solution.

    ReplyDelete