Tuesday, May 12, 2015

Ubuntu VS upstream versions of OVS

Some of the bugs amuse me by breaking system while everything is done right and follows the best practices.

Here is the story of one of such bugs.

Given: Openstack installation on Ubuntu. Openvswitch is used as software switch on compute nodes. We're using the most recent openvswitch. We build it from sources with our CI, and deploy it from our mirror via apt.

Problem: Bridges in /etc/network/interfaces are not handled properly on a few nodes. Most of the nodes work fine, but a few do not.

Here is an example of /etc/network/interfaces:

allow-br_int em0
iface p2p1 inet manual
        ovs_type OVSPort
        ovs_bridge br_int

allow-ovs br_int
iface br_int inet manual
        ovs_type OVSBridge
        ovs_ports em0


(Yes, that type of syntax is a bit odd, but it is supported by OVS according to the documentation)

Those options should be handled by openvswitch-vswitch init script (/etc/init.d/openvswitch-switch), and, indeed, it does this:

network_interfaces () {
    INTERFACES="/etc/network/interfaces"
    [ -e "${INTERFACES}" ] || return
    bridges=`awk '{ if ($1 == "allow-ovs") { print $2; } }' "${INTERFACES}"`
    [ -n "${bridges}" ] && $1 --allow=ovs ${bridges}
}


But it ignores 'allow-ovs' settings on some nodes. And respects them on others.

The nodes are identical, with identical configs in software and hardware. They should never behave differently until a hardware failure happens.

I've decided to trace what scripts were doing on broken nodes. I've checked every step of  /etc/init/openvswitch-switch.conf... And I found it contains not a single line of the code to parse /etc/network/interfaces. Looks like a broken init script...


I've dug into the build scripts to see what happens during a build. And one thing stood out: I was not able to find anything related to upstart config in debian/ directory of source package.

But /etc/init/openvswitch-switch was provided by OVS (query on broken nodes):

dpkg -S /etc/init/openvswitch-switch.conf
openvswitch-switch: /etc/init/openvswitch-switch.conf


This sounds crazy. I've checked post-/pre-install scripts: none.

Then an idea strikes me: it is not in the package we're installing. That was a trace of mistakenly installed Ubuntu's version.

Yes, on the 'broken' nodes OVS was installed from Ubuntu mirror and, lately, upgraded to our version. Ubuntu's OVS provides etc/init/openvswitch-switch.conf, and upstream version does not.

But all init scripts are considered to be configuration files. They are not removed during 'remove' operation (only purge will wipe them), and they are kept intact during upgrade operation.

And upstart prefer own config over sysv-init (/etc/init.d). So, on the broken nodes upstart used upstart config and on working nodes it used sysv-init.

The fix was obvious: remove every upstart job related to OVS from every node.

But who is to blame?

  1. Ubuntu OVS maintainer adds upstart job to the package and marks it as config. Check.
  2. OVS developers have no idea about upstart job and provide sysv-init script as config file. During upgrade it does not interfere with upstart job in any way. Check.
  3. According to Debian Policy cofig files are preserved during package upgrade. Removal of config file between package versions is not reason to remove it during upgrade. Check.
  4. Sysadmin assumes that upgrade of the package will replace it files with newer versions and gives warning if there is an conflict between versions of the config files. Check.
  5. Upstart prefer own config over legacy sysv-init when staring a service. Documented and reasoned. Check.

Everyone does everything in a proper way, but the system is broken as result.

Sigh...

No comments:

Post a Comment