git.openstack.org adventures

Over the past few months I started to notice occasional issues when cloning repositories (particularly nova) from git.openstack.org.

It would fail with something like

git clone -vvv git://git.openstack.org/openstack/nova .
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

The problem would occur sporadically during our 3rd party CI runs causing them to fail. Initially these went somewhat ignored as rechecks on the jobs would succeed and the world would be shiny again. However, as they became more prominent the issue needed to be addressed.

When a patch merges in gerrit it is replicated out to 5 different cgit backends (git0[1-5].openstack.org). These are then balanced by two HAProxy frontends which are on a simple DNS round-robin.

                          +-------------------+
                          | git.openstack.org |
                          |    (DNS Lookup)   |
                          +--+-------------+--+
                             |             |
                    +--------+             +--------+
                    |           A records           |
+-------------------v----+                    +-----v------------------+
| git-fe01.openstack.org |                    | git-fe02.openstack.org |
|   (HAProxy frontend)   |                    |   (HAProxy frontend)   |
+-----------+------------+                    +------------+-----------+
            |                                              |
            +-----+                                    +---+
                  |                                    |
            +-----v------------------------------------v-----+
            |    +---------------------+  (source algorithm) |
            |    | git01.openstack.org |                     |
            |    |   +---------------------+                 |
            |    +---| git02.openstack.org |                 |
            |        |   +---------------------+             |
            |        +---| git03.openstack.org |             |
            |            |   +---------------------+         |
            |            +---| git04.openstack.org |         |
            |                |   +---------------------+     |
            |                +---| git05.openstack.org |     |
            |                    |  (HAProxy backend)  |     |
            |                    +---------------------+     |
            +------------------------------------------------+

Reproducing the problem was difficult. At first I was unable to reproduce locally, or even on an isolated turbo-hipster run. Since the problem appeared to be specific to our 3rd party tests (little evidence of it in 1st party runs) I started by adding extra debugging output to git.

We were originally cloning repositories via the git:// protocol. The debugging information was unfortunately limited and provided no useful diagnosis. Switching to https allowed for more CURL output (when using GIT_CURL_VERBVOSE=1 and GIT_TRACE=1) but this in itself just created noise. It actually took me a few days to remember that the servers are running arbitrary code anyway (a side effect of testing) and therefore cloning from the potentially insecure http protocol didn’t provide any further risk.

Over http we got a little more information, but still nothing that was conclusive at this point:

git clone -vvv http://git.openstack.org/openstack/nova .

error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly
fatal: protocol error: bad pack header

After a bit it became more apparent that the problems would occur mostly during high (patch) traffic times. That is, when a lot of tests need to be queued. This lead me to think that either the network turbo-hipster was on was flaky when doing multiple git clones in parallel or the git servers were flaky. The lack of similar upstream failures lead me to initially think it was the former. In order to reproduce I decided to use Ansible to do multiple clones of repositories and see if that would uncover the problem. If needed I would have then extended this to orchestrating other parts of turbo-hipster in case the problem was systemic of something else.

Firstly I need to clone from a bunch of different servers at once to simulate the network failures more closely (rather than doing multiple clones on the one machine or from the one IP in containers for example). To simplify this I decided to learn some Ansible to launch a bunch of nodes on Rackspace (instead of doing it by hand).

Using the pyrax module I put together a crude playbook to launch a bunch of servers. There is likely much neater and better ways of doing this, but it suited my needs. The playbook takes care of placing appropriate sshkeys so I could continue to use them later.

    ---
    - name: Create VMs
      hosts: localhost
      vars:
        ssh_known_hosts_command: "ssh-keyscan -H -T 10"
        ssh_known_hosts_file: "/root/.ssh/known_hosts"
      tasks:
        - name: Provision a set of instances
          local_action:
            module: rax
            name: "josh-testing-ansible"
            flavor: "4"
            image: "Ubuntu 12.04 LTS (Precise Pangolin) (PVHVM)"
            region: "DFW"
            count: "15"
            group: "raxhosts"
            wait: yes
          register: raxcreate

        - name: Add the instances we created (by public IP) to the group 'raxhosts'
          local_action:
            module: add_host
            hostname: "{{ item.name }}"
            ansible_ssh_host: "{{ item.rax_accessipv4 }}"
            ansible_ssh_pass: "{{ item.rax_adminpass }}"
            groupname: raxhosts
          with_items: raxcreate.success
          when: raxcreate.action == 'create'

        - name: Sleep to give time for the instances to start ssh
          #there is almost certainly a better way of doing this
          pause: seconds=30

        - name: Scan the host key
          shell: "{{ ssh_known_hosts_command}} {{ item.rax_accessipv4 }} >> {{ ssh_known_hosts_file }}"
          with_items: raxcreate.success
          when: raxcreate.action == 'create'

    - name: Set up sshkeys
      hosts: raxhosts
      tasks:
       - name: Push root's pubkey
         authorized_key: user=root key="{{ lookup('file', '/root/.ssh/id_rsa.pub') }}"

From here I can use Ansible to work on those servers using the rax inventory. This allows me to address any nodes within my tenant and then log into them with the seeded sshkey.

The next step of course was to run tests. Firstly I just wanted to reproduce the issue, so in order to do that it would crudely set up an environment where it can simply clone nova multiple times.

    ---
    - name: Prepare servers for git testing
      hosts: josh-testing-ansible*
      serial: "100%"
      tasks:
        - name: Install git
          apt: name=git state=present update_cache=yes
        - name: remove nova if it is already cloned
          shell: 'rm -rf nova'

    - name: Clone nova and monitor tcpdump
      hosts: josh-testing-ansible*
      serial: "100%"
      tasks:
        - name: Clone nova
          shell: "git clone http://git.openstack.org/openstack/nova"

By default Ansible runs with 5 folked processes. Meaning that Ansible would work on 5 servers at a time. We want to exercise git heavily (in the same way turbo-hipster does) so we use the –forks param to run the clone on all the servers at once. The plan was to keep launching servers until the error reared its head from the load.

To my surprise this happened with very few nodes (less than 15, but I left that as my minimum testing). To confirm I also ran the tests after launching further nodes to see it fail at 50 and 100 concurrent clones. It turned out that the more I cloned the higher the failure rate percentage was.

Now that I had the problem reproducing, it was time to do some debugging. I modified the playbook to capture tcpdump information during the clone. Initially git was cloning over IPv6 so I turned that off on the nodes to force IPv4 (just in case it was a v6 issue, but the problem did present itself on both networks). I also locked git.openstack.org to one IP rather than randomly hitting both front ends.

    ---
    - name: Prepare servers for git testing
      hosts: josh-testing-ansible*
      serial: "100%"
      tasks:
        - name: Install git
          apt: name=git state=present update_cache=yes
        - name: remove nova if it is already cloned
          shell: 'rm -rf nova'

    - name: Clone nova and monitor tcpdump
      hosts: josh-testing-ansible*
      serial: "100%"
      vars:
        cap_file: tcpdump_{{ ansible_hostname }}_{{ ansible_date_time['epoch'] }}.cap
      tasks:
        - name: Disable ipv6 1/3
          sysctl: name="net.ipv6.conf.all.disable_ipv6" value=1 sysctl_set=yes
        - name: Disable ipv6 2/3
          sysctl: name="net.ipv6.conf.default.disable_ipv6" value=1 sysctl_set=yes
        - name: Disable ipv6 3/3
          sysctl: name="net.ipv6.conf.lo.disable_ipv6" value=1 sysctl_set=yes
        - name: Restart networking
          service: name=networking state=restarted
        - name: Lock git.o.o to one host
          lineinfile: dest=/etc/hosts line='23.253.252.15 git.openstack.org' state=present
        - name: start tcpdump
          command: "/usr/sbin/tcpdump -i eth0 -nnvvS -w /tmp/{{ cap_file }}"
          async: 6000000
          poll: 0 
        - name: Clone nova
          shell: "git clone http://git.openstack.org/openstack/nova"
          #shell: "git clone http://github.com/openstack/nova"
          ignore_errors: yes
        - name: kill tcpdump
          command: "/usr/bin/pkill tcpdump"
        - name: compress capture file
          command: "gzip {{ cap_file }} chdir=/tmp"
        - name: grab captured file
          fetch: src=/tmp/{{ cap_file }}.gz dest=/var/www/ flat=yes

This gave us a bunch of compressed capture files that I was then able to seek the help of my colleagues to debug (a particular thanks to Angus Lees). The results from an early run can be seen here: http://119.9.51.216/old/run1/

Gus determined that the problem was due to a RST packet coming from the source at roughly 60 seconds. This indicated it was likely we were hitting a timeout at the server or a firewall during the git-upload-pack of the clone.

The solution turned out to be rather straight forward. The git-upload-pack had simply grown too large and would timeout depending on the load on the servers. There was a timeout in apache as well as the HAProxy config for both frontend and backend responsiveness. The relative patches can be found at https://review.openstack.org/#/c/192490/ and https://review.openstack.org/#/c/192649/

While upping the timeout avoids the problem, certain projects are clearly pushing the infrastructure to its limits. As such a few changes were made by the infrastructure team (in particular James Blair) to improve git.openstack.org’s responsiveness.

Firstly git.openstack.org is now a higher performance (30GB) instance. This is a large step up from the previous (8GB) instances that were used as the frontend previously. Moving to one frontend additionally meant the HAProxy algorithm could be changed to leastconn to help balance connections better (https://review.openstack.org/#/c/193838/).

                          +--------------------+
                          | git.openstack.org  |
                          | (HAProxy frontend) |
                          +----------+---------+
                                     |
                                     |
            +------------------------v------------------------+
            |  +---------------------+  (leastconn algorithm) |
            |  | git01.openstack.org |                        |
            |  |   +---------------------+                    |
            |  +---| git02.openstack.org |                    |
            |      |   +---------------------+                |
            |      +---| git03.openstack.org |                |
            |          |   +---------------------+            |
            |          +---| git04.openstack.org |            |
            |              |   +---------------------+        |
            |              +---| git05.openstack.org |        |
            |                  |  (HAProxy backend)  |        |
            |                  +---------------------+        |
            +-------------------------------------------------+

All that was left was to see if things had improved. I rerun the test across 15, 30 and then 45 servers. These were all able to clone nova reliably where they had previously been failing. I then upped it to 100 servers where the cloning began to fail again.

Post-fix logs for those interested:
http://119.9.51.216/run15/
http://119.9.51.216/run30/
http://119.9.51.216/run45/
http://119.9.51.216/run100/
http://119.9.51.216/run15per100/

At this point, however, I’m basically performing a Distributed Denial of Service attack against git. As such, while the servers aren’t immune to a DDoS the problem appears to be fixed.