Fix race in attachable network attachment #36191

cpuguy83 · 2018-02-02T18:49:57Z

Attachable networks are networks created on the cluster which can then
be attached to by non-swarm containers. These networks are lazily
created on the node that wants to attach to that network.

When no container is currently attached to one of these networks on a
node, and then multiple containers which want that network are started
concurrently, this can cause a race condition in the network attachment
where essentially we try to attach the same network to the node twice.

To easily reproduce this issue you must use a multi-node cluster with a
worker node that has lots of CPUs (I used a 36 CPU node).

Repro steps:

On manager, docker network create -d overlay --attachable test
On worker, docker create --restart=always --network test busybox top, many times... 200 is a good number (but not much more due to
subnet size restrictions)
Restart the daemon

When the daemon restarts, it will attempt to start all those containers
simultaneously. Note that you could try to do this yourself over the API,
but it's harder to trigger due to the added latency from going over
the API.

The error produced happens when the daemon tries to start the container
upon allocating the network resources:

attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

What happens here is the worker makes a network attachment request to
the manager. This is an async call which in the happy case would cause a
task to be placed on the node, which the worker is waiting for to get
the network configuration.
In the case of this race, the error occurs on the manager like this:

task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e

The task is not created and the worker times out waiting for the task.

The mitigation for this is to make sure that only one attachment request
is in flight for a given network at a time when the network doesn't
already exist on the node. If the network already exists on the node
there is no need for synchronization because the network is already
allocated and on the node so there is no need to request it from the
manager.

This basically comes down to a race with Find(network) || Create(network) without any sort of synchronization.

Attachable networks are networks created on the cluster which can then be attached to by non-swarm containers. These networks are lazily created on the node that wants to attach to that network. When no container is currently attached to one of these networks on a node, and then multiple containers which want that network are started concurrently, this can cause a race condition in the network attachment where essentially we try to attach the same network to the node twice. To easily reproduce this issue you must use a multi-node cluster with a worker node that has lots of CPUs (I used a 36 CPU node). Repro steps: 1. On manager, `docker network create -d overlay --attachable test` 2. On worker, `docker create --restart=always --network test busybox top`, many times... 200 is a good number (but not much more due to subnet size restrictions) 3. Restart the daemon When the daemon restarts, it will attempt to start all those containers simultaneously. Note that you could try to do this yourself over the API, but it's harder to trigger due to the added latency from going over the API. The error produced happens when the daemon tries to start the container upon allocating the network resources: ``` attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded ``` What happens here is the worker makes a network attachment request to the manager. This is an async call which in the happy case would cause a task to be placed on the node, which the worker is waiting for to get the network configuration. In the case of this race, the error ocurrs on the manager like this: ``` task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e ``` The task is not created and the worker times out waiting for the task. --- The mitigation for this is to make sure that only one attachment reuest is in flight for a given network at a time *when the network doesn't already exist on the node*. If the network already exists on the node there is no need for synchronization because the network is already allocated and on the node so there is no need to request it from the manager. This basically comes down to a race with `Find(network) || Create(network)` without any sort of syncronization. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

fcrisciani · 2018-02-02T19:15:56Z

@cpuguy83 question just to clarify if I got the issue, is the issue basically triggered by making too many request for the same network from a worker so that the manager does not allocate a task in time on that worker?

cpuguy83 · 2018-02-02T19:22:04Z

@fcrisciani No, it's because the manager errors out and never assigns the task. The worker is waiting for a task assignment that never comes and eventually times out.

cpuguy83 · 2018-02-02T19:24:14Z

btw, I considered putting this in the cluster provider's AttachNetwork function, but then we end up loosing some control over when we lock.

fcrisciani · 2018-02-02T19:29:45Z

@cpuguy83 I mean, this makes the client a good citizen but kind of feel like that there is a need for a fix on the manager also, if you receive 2 or more req for the same network should not stop allocating it, right?

cpuguy83 · 2018-02-02T19:49:17Z

@fcrisciani It does stop allocating it, it throws the error saying it can't allocate. The issue isn't the concurrency on the manager, it's the concurrency in the worker.

fcrisciani

LGTM

fcrisciani · 2018-02-02T21:19:55Z

Move the discussion offline with @cpuguy83 and now the condition of the race is clear that is on the client side. 2 routines can race in the create network so the new lock will prevent that.

vdemeester

LGTM 🐸

selansen · 2018-02-02T22:04:23Z

LGTM

ghost · 2018-02-02T19:56:38Z

daemon/container_operations.go

+		daemon.attachableNetworkLock.Lock(id)
+		defer daemon.attachableNetworkLock.Unlock(id)
+	}
+


Please bear with me as I am new here .... I have a few questions:

Is the purpose of the attachableNetworkLock being potentially nil to only perform the lock/test if the node is part of a swarm cluster? If so then:
a. Is locking this function (only) really such a high cost to pay to make the test optional? Or is there some danger of deadlock when not in a cluster?
b. Is it correct that if the node leaves the cluster it will still be acquiring this lock anyway?

From what I can see in daemon.FindNetwork(), after return either n != nil or err != nil. This is perfectly reasonable behavior and what one should expect, IMHO. That being the case, this lock test could move in to the err != nil section above. This makes the logic a little cleaner.

Side note re: 2 -- it wouldn't be a bad idea to simplify the

if err != nil { ... } if n != nil {... }

Code to be just:

if err != nil { ... } else { .... }

Honestly, the n != nil is an optimization since if it's not nil, then there is no need to lock.
I also despise else blocks. 🤷‍♂️
The reason to not implicitly lock is because this is extra serialization when it's not really needed.

nod I get that the lock isn't needed if n != nil. Was just pointing out that that test was already effectively done above with the err != nil, so the body could be moved up there (w/ test for lock existence). Keeps the "network not found" logic in one place.

The if / else was a side-comment re: the way the existing code is already structured. I get the preference for avoiding "else" clauses in general. As a fresh reader of the code, I was expressing that it might be better in this case because if (x) {...} if (y) {...} leads me to believe that both x and y could be true at the same time for purposes of subsequent logic. (and that is not the case here) In any case it's immaterial for purposes of this fix.

Generally, if an early-return is possible, I'd try to avoid an else. I do agree that these "ifs" are a bit "iffy" 😅, e.g.:

moby/daemon/container_operations.go

Lines 352 to 358 in c379d26

if err != nil {

// We should always be able to find the network for a

// managed container.

if container.Managed {

return nil, nil, err

}

}

could be simplified to:

if err != nil && container.Managed { // We should always be able to find the network for a managed container. return nil, nil, err }

The second if would become;

// If we found a network and if it is not dynamically created // we should never attempt to attach to that network here. if n != nil && (container.Managed || !n.Info().Dynamic()) { return n, nil, nil }

Which leads to an interesting question: if the container is managed, we never attach, so, should this simply be this?:

if container.Managed { return n, nil, err }

The above simplification is very appealing. Presumably it would be followed by an:

if n != nil && !n.Info().Dynamic() { return n, nil, nil }

to handle the second test? Again, would be much more readable (to me anyway).

@ctelfer-docker I agree (but also felt it being separate from this PR): would you be interested in opening a PR for that?

Agree and will do.

thaJeztah

LGTM

GordonTheTurtle added the status/0-triage label Feb 2, 2018

cpuguy83 added status/2-code-review and removed status/0-triage labels Feb 2, 2018

fcrisciani approved these changes Feb 2, 2018

View reviewed changes

vdemeester approved these changes Feb 2, 2018

View reviewed changes

ghost reviewed Feb 2, 2018

View reviewed changes

thaJeztah approved these changes Feb 4, 2018

View reviewed changes

yongtang merged commit 6987557 into moby:master Feb 5, 2018

cpuguy83 deleted the fix_attachable_network_race branch February 5, 2018 17:43

fcrisciani mentioned this pull request Apr 25, 2018

Containers on an attachable overlay network fail to restart after dockerd restart #32607

Open

Jul	AUG	Sep
	02
2024	2025	2026

	if err != nil {
	// We should always be able to find the network for a
	// managed container.
	if container.Managed {
	return nil, nil, err
	}
	}

Fix race in attachable network attachment #36191

Fix race in attachable network attachment #36191

Uh oh!

Conversation

cpuguy83 commented Feb 2, 2018

Uh oh!

fcrisciani commented Feb 2, 2018

Uh oh!

cpuguy83 commented Feb 2, 2018

Uh oh!

cpuguy83 commented Feb 2, 2018

Uh oh!

fcrisciani commented Feb 2, 2018

Uh oh!

cpuguy83 commented Feb 2, 2018

Uh oh!

fcrisciani left a comment

Choose a reason for hiding this comment

Uh oh!

fcrisciani commented Feb 2, 2018

Uh oh!

vdemeester left a comment

Choose a reason for hiding this comment

Uh oh!

selansen commented Feb 2, 2018

Uh oh!

ghost Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

cpuguy83 Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

ghost Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

thaJeztah Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

ghost Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

thaJeztah Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

ghost Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!