Thursday, December 24, 2015

How to randomly distribute networks to all dhcp-agents

If you are have an issue with DHCP agents, one of the solutions is to recreate DHCP ports.  Here the small howto:

Step 1: remove all networks from all DHCP agents:

 for a in `neutron agent-list|grep dhcp|awk '{print $2}'`;do neutron net-list|awk '{print $2}'|xargs -I I neutron dhcp-agent-network-remove $a I ;done

Step 2: Throw networks randomly on DHCP agents:


neutron agent-list|grep DHCP|grep True|grep ':-)'|awk '{print $2}' > /tmp/dhcp_agents
for net in `neutron net-list|awk '{print $2}'`;do neutron dhcp-agent-network-add `sort -R /tmp/dhcp_agents|tail -1` $net ;done

Note: This will works only in homogenous network settings (all dhcp-agents are equal).

You could made it one-liner if you wants, but looped calls for neutron agent-list is not a good idea.

Wednesday, December 16, 2015

Converting hg repository to git

(offtopic for openstack, but important task for devops around CI).

Our CI wants git to build proper debian packages. We do not use mercurial in any way and all our internals are in gits. But external software we using is published in mercurial and do not provide pre-build packages.

We have decided to convert external hg to local git, and then proceed as usual.

Here the simple way to convert hg repository to git:
1) Install mercurial and hg-fast-export:
    sudo apt-get install -y mercurial hg-fast-export
2)  Clone hg repository to some temporal place: (e.g. /tmp):
    hg clone https://bitbucket.org/some/repo
3) Create new git repository in your gitlab/github
4) Create local empty git repository:
    mkdir ~/git/some_repo
    cd ~/git/some_repo
    git init
5) Run hg-fast-export:
    cd ~/git/some_repo
    hg-fast-export -r /tmp/some/repo
6) Add remote origin to git:
    git remote add origin git@internal_git:some_repo.git
7) Push changes:
    git push --all
8) Push tags:
    git push --tags

Thats all. No more strange thoughts about 'How to force intercourse between debian-jenkins-glue & mercurial'.

Tuesday, May 12, 2015

Ubuntu VS upstream versions of OVS

Some of the bugs amuse me by breaking system while everything is done right and follows the best practices.

Here is the story of one of such bugs.

Given: Openstack installation on Ubuntu. Openvswitch is used as software switch on compute nodes. We're using the most recent openvswitch. We build it from sources with our CI, and deploy it from our mirror via apt.

Problem: Bridges in /etc/network/interfaces are not handled properly on a few nodes. Most of the nodes work fine, but a few do not.

Here is an example of /etc/network/interfaces:

allow-br_int em0
iface p2p1 inet manual
        ovs_type OVSPort
        ovs_bridge br_int

allow-ovs br_int
iface br_int inet manual
        ovs_type OVSBridge
        ovs_ports em0


(Yes, that type of syntax is a bit odd, but it is supported by OVS according to the documentation)

Those options should be handled by openvswitch-vswitch init script (/etc/init.d/openvswitch-switch), and, indeed, it does this:

network_interfaces () {
    INTERFACES="/etc/network/interfaces"
    [ -e "${INTERFACES}" ] || return
    bridges=`awk '{ if ($1 == "allow-ovs") { print $2; } }' "${INTERFACES}"`
    [ -n "${bridges}" ] && $1 --allow=ovs ${bridges}
}


But it ignores 'allow-ovs' settings on some nodes. And respects them on others.

The nodes are identical, with identical configs in software and hardware. They should never behave differently until a hardware failure happens.

I've decided to trace what scripts were doing on broken nodes. I've checked every step of  /etc/init/openvswitch-switch.conf... And I found it contains not a single line of the code to parse /etc/network/interfaces. Looks like a broken init script...


I've dug into the build scripts to see what happens during a build. And one thing stood out: I was not able to find anything related to upstart config in debian/ directory of source package.

But /etc/init/openvswitch-switch was provided by OVS (query on broken nodes):

dpkg -S /etc/init/openvswitch-switch.conf
openvswitch-switch: /etc/init/openvswitch-switch.conf


This sounds crazy. I've checked post-/pre-install scripts: none.

Then an idea strikes me: it is not in the package we're installing. That was a trace of mistakenly installed Ubuntu's version.

Yes, on the 'broken' nodes OVS was installed from Ubuntu mirror and, lately, upgraded to our version. Ubuntu's OVS provides etc/init/openvswitch-switch.conf, and upstream version does not.

But all init scripts are considered to be configuration files. They are not removed during 'remove' operation (only purge will wipe them), and they are kept intact during upgrade operation.

And upstart prefer own config over sysv-init (/etc/init.d). So, on the broken nodes upstart used upstart config and on working nodes it used sysv-init.

The fix was obvious: remove every upstart job related to OVS from every node.

But who is to blame?

  1. Ubuntu OVS maintainer adds upstart job to the package and marks it as config. Check.
  2. OVS developers have no idea about upstart job and provide sysv-init script as config file. During upgrade it does not interfere with upstart job in any way. Check.
  3. According to Debian Policy cofig files are preserved during package upgrade. Removal of config file between package versions is not reason to remove it during upgrade. Check.
  4. Sysadmin assumes that upgrade of the package will replace it files with newer versions and gives warning if there is an conflict between versions of the config files. Check.
  5. Upstart prefer own config over legacy sysv-init when staring a service. Documented and reasoned. Check.

Everyone does everything in a proper way, but the system is broken as result.

Sigh...

Tuesday, April 7, 2015

Btrfs is still not very production-ready

We have few servers with not-that-important virtual machines under BTRFS. We decide to use it to utilize our not-that-robust hard drives of different sizes in JBOD with maximum efficiency.

We do not care about loosing those VMs, but we don't want to loose them all at the same time. So idea was to use BTRFS over LVM on those dirves, and if one drive failed, BTRFS will continue to operate. In my initial testing btrfs was a single filesystem to survive and operates on volume with missing parts in the middle (of course it would be disaster, RO mode, and garbage instead of disks of affected VMs, but at least some of them will survive to be migrated to other hosts).

Anyway, host catched OOM (yea, save on everything!), and the bad one (it wiped out OVS, neutron, libvirt and ssh). After reboot it stucked at the boot during mount with errors:

btrfs: device fsid 0520e52d-7681-4156-9061-388e374c4e16 devid 1 transid 407769 /dev/mapper/host-nova
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
btrfs: failed to read log tree
btrfs: open_ctree failed

btrfsck complains:
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
btrfsck: disk-io.c:439: find_and_setup_log_root: Assertion `!(!log_root->node)' failed.
Aborted

NNNice!

But I quickly found article about this message. And it notes about 'new version of userspace tools'. We have been running Ubuntu 12.04, so I've upgraded it to trusty btrfs-utils package. New version of btrfsck can do more, but during execution it consumed 18 gigs of memory. Just for 2Tb volume. Hard to image how it would looks on the machine with 4-8 Gb...


btrfsck --repair /dev/mapper/host-nova
enabling repair mode
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
parent transid verify failed on 471304036352 wanted 407770 found 407769
Ignoring transid failure
Checking filesystem on /dev/mapper/host-nova
UUID: 0520e52d-7681-4156-9061-388e374c4e16
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 5 inode 407 errors 80, file extent overlap
found 214360836421 bytes used err is 1
total csum bytes: 0
total tree bytes: 10665472000
total fs tree bytes: 5452877824
total extent tree bytes: 5212286976
btree space waste bytes: 843886520
file data blocks allocated: 892593057792
 referenced 890681024512
Btrfs v3.12

More job done, but no help, it still not mounting with the same complain.


As last resort I've tried btrfs-zero-log according to article. And it helped! Volume has been mounted and has been operating normally since than.

But I still be worry about that behavior. Why log was corrupted? Merely OOM can cause that much problems? And why it error reporting is so cryptic?

Meanwhile I was able to save image (btrfs-image, just 800 Mb of metadata for 2Tb filesystem) and I will report that to the kernel bugtracker.

Wednesday, March 11, 2015

How to filter IP addresses inside GRE in tcpdump

If you ever debugged network node with GRE you should know that  painful feeling of tcpdump output. Thousands IPs and you can not differentiate them, because they are in GRE.

And there is no filters for IP addresses inside GRE. You can not say 'tcpdump -ni eth0 proto gre and host 192.168.0.1'. Well, you can, but 'host' will be used only to filter source or destination of GRE packets, not the incapsulated IP packet.

Unfortunately there is no nice syntax. Fortunately, there is some, at least.

You'll need to convert IP address to network-byte-ordered  integer. For this every octet should be converted to hex and joined together 'as is'. 100.64.6.7 will become 0x64400607.

For python: there is module ipaddress, but it's not available in default installation. So we'll do it manually with minimal code:

>>>  "0x%x%x%x%x" % tuple(map(int,'192.168.0.1'.split('.')))
'0xc0a801'

(sorry for mad code, but I wanted to keep it short).

Result of that code is 'number' representing IP address (in the example above - 192.168.0.1).

Now we can run tcpdump:

tcpdump -ni eth1 'proto gre and (ip[54:4]=0xc0a801 or ip[58:4]=0xc0a801)'

Numbers in the square brackets near 'ip' is offset and size of the field. IPv4 address is 4 bytes long. Because GRE add 42 bytes overhead (20 bytes first IP header, 8 bytes GRE header, 14 bytes encapsulated Ethernet header), we taking normal IP source/destination offset (see here) and adding it.

ip[54:4] is the source IP
ip[58:4] is the destination IP
ip [51:1] is type (protocol type, like ICMP, TCP or UDP).

ARP stuff is:

ip[56:4] (sender) and ip[66:4] (target) addresses.
ip[50:6] and ip[60:6] are (sender/target) MAC addresses (but you will need to calculate proper hex humber for 6-byte MAC - I didn't test this).

Tuesday, January 27, 2015

All possible hardware options for nova

Openstack documentation notes hardware options for images: http://docs.openstack.org/user-guide/content/cli_manage_images.html

But there are only few examples of such options without explicit list. After some googling in bing and duckduckgo I gave up and decided to create my own.

Simple one-liner to find all strings starting with 'hw_' in Nova repo:
 

$ egrep [\'\"]hw_[a-zA-Z_0-9]+[\'\"] -rho |sed s/[\"\']//g|sort -u

Here are results for Juno:
  • hw_cdrom_bus
  • hw_cpu_cores
  • hw_cpu_max_cores
  • hw_cpu_max_sockets
  • hw_cpu_max_threads
  • hw_cpu_sockets
  • hw_cpu_threads
  • hw_disk_bus
  • hw_disk_discard
  • hw_machine_type
  • hw_numa_nodes
  • hw_qemu_guest_agent
  • hw_rng_model
  • hw_scsi_model
  • hw_serial_port_count
  • hw_veb
  • hw_video_model
  • hw_video_ram
  • hw_vif_model
  • hw_watchdog_action
Not very impressive list, is it?

Wednesday, December 31, 2014

Metadata failed: Network is unreachable


Yesterday I've got a rather strange problem: Our Debian image is not retrieving metadata and complaining 'no route to host' about each attempt to connect to the metadata server, but Cirros is working just fine.

It was more interesting because same Debian image works in our previous installation.

The main difference in configuration of those two installation (aside from the jump from Havana to Juno) is network configuration.

Our old configuration is typical: overlay network, private router per tenant, gateway to external network, floating IP.

New (lab) configuration is more daring: each tenant owns a private external network with provider router (inside vlan or vxlan). But OpenStack still provides DHCP and metadata services.

Subnet looks like this:

neutron subnet-create net --allocation-pool start=xx.xx.xx.130,end=xx.xx.xx.158 xx.xx.xx.128/27 --enable-dhcp --gateway=xx.xx.xx.129 --dns-nameserver=8.8.8.8

DHCP agent is set up for isolated networks.

Because of the default settings in the Debian image (accept only ssh key) it's really hard to debug (no metadata, no key, no access to the image). And the 'image for debugging' (Cirros) is working! What a pain in the stack.

After rebuilding the debug Debian image with hardcoded password (never do this in production!) I was able to log in to Debian and look around.

Cirros is working:
$ curl http://169.254.169.254/2009-04-04/meta-data/placement/availability-zone
nova

But Debian does not:
$ wget 169.254.169.254/2009-04-04/meta-data/placement/availability-zone
--2014-12-31 09:49:48-- http://169.254.169.254/2009-04-04/meta-data/placement/availability-zone
Connecting to 169.254.169.254:80... failed: Network is unreachable.

The difference is easy to spot: routing (ip route show)

Cirros:
default via xx.xx.xx.129 dev eth0
xx.xx.xx.128/27 dev eth0 src xx.xx.xx.132
169.254.169.254 via xx.xx.xx.131 dev eth0

Debian:
default via xx.xx.xx.129 dev eth0
xx.xx.xx.128/27 dev eth0 proto kernel scope link src xx.xx.xx.13


Note the lack of '169.254.169.254' line in Debian's version. xx.xx.xx.131 is DHCP/metadata IP. DHCP is to blame.

To prove this I need to snoop traffic to see what happens.

The best way to do this is go to the namespace of the DHCP agent and use tshark to look at the DHCP exchange.

Tshark is a console version of wireshark and it provides much more readable output than tcpdump.

neutron net-list:

+--------------------------------------+---------+------------------------------------------------------+

| id | name | subnets |

+--------------------------------------+---------+------------------------------------------------------+

| 40a7b4bf-8714-42c0-bed2-8720cd650958 | network | d4fbd66a-dbaa-436e-90ed-0260a5f7d265 88.85.77.128/27 |

+--------------------------------------+---------+------------------------------------------------------+

On the network node, go to the DHCP agent namespace:

stdbuf -e0 -o0 ip net exec qdhcp-40a7b4bf-8714-42c0-bed2-8720cd650958 /bin/bash

(stdbuf is used to disable buffering, otherwise the text on screen would appear with a delay)

'ip l' to see interfaces, and:

tshark -Vi tap3acc3114-45 -Pf 'udp'

here are the snippets from packets I saw:

Cirros:

request:

Parameter Request List Item: (1) Subnet Mask
Parameter Request List Item: (3) Router
Parameter Request List Item: (6) Domain Name Server
Parameter Request List Item: (12) Host Name
Parameter Request List Item: (15) Domain Name
Parameter Request List Item: (26) Interface MTU
Parameter Request List Item: (28) Broadcast Address
Parameter Request List Item: (42) Network Time Protocol Servers
Parameter Request List Item: (121) Classless Static Route

reply:

(skip)
Option: (3) Router
Length: 4
Router: xx.xx.xx.129 (xx.xx.xx.129)
Option: (121) Classless Static Route
Length: 14
Subnet/MaskWidth-Router: 169.254.169.254/32-xx.xx.xx.131
Subnet/MaskWidth-Router: default-xx.xx.xx.129
(skip)

From Debian:

request:
Parameter Request List Item: (1) Subnet Mask
Parameter Request List Item: (3) Router
Parameter Request List Item: (6) Domain Name Server
Parameter Request List Item: (15) Domain Name
Parameter Request List Item: (28) Broadcast Address
Parameter Request List Item: (12) Host Name
Parameter Request List Item: (7) Log Server
Parameter Request List Item: (9) LPR Server
Parameter Request List Item: (42) Network Time Protocol Servers
Parameter Request List Item: (48) X Window System Font Server
Parameter Request List Item: (49) X Window System Display Manager
Parameter Request List Item: (26) Interface MTU

With an obvious reply:

(skip)
Option: (3) Router
Length: 4
Router: 88.85.77.129 (xx.xx.77.129)
(skip)

Note the lack of '(121) Classless Static Route' in Debian's request. So the problem is in different requests.

After messing around a bit and talking with the guy who manages images, I found that we're using pump (https://packages.debian.org/wheezy/pump) as the DHCP client in Debian. And it is provided without the default config and it does not ask for additional routes.

So the problem was in the use of pump 'as is' without config. It was working in our previous configuration by pure luck because the metadata was served from router namespace and there were no additional routes at all.

Pump is guilty (and we are not, heh).

Conclusion:

We're going to either switch back to patched (hello, shellshock) isc-dhcp-client, or we'll try to fix the pump config.

Option 121 (Classless Static Route) is very important for OpenStack instances.