The Wayback Machine - https://web.archive.org/web/20250802015159/https://github.com/moby/moby/pull/36191
Skip to content

Fix race in attachable network attachment #36191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 5, 2018

Conversation

cpuguy83
Copy link
Member

@cpuguy83 cpuguy83 commented Feb 2, 2018

Attachable networks are networks created on the cluster which can then
be attached to by non-swarm containers. These networks are lazily
created on the node that wants to attach to that network.

When no container is currently attached to one of these networks on a
node, and then multiple containers which want that network are started
concurrently, this can cause a race condition in the network attachment
where essentially we try to attach the same network to the node twice.

To easily reproduce this issue you must use a multi-node cluster with a
worker node that has lots of CPUs (I used a 36 CPU node).

Repro steps:

  1. On manager, docker network create -d overlay --attachable test
  2. On worker, docker create --restart=always --network test busybox top, many times... 200 is a good number (but not much more due to
    subnet size restrictions)
  3. Restart the daemon

When the daemon restarts, it will attempt to start all those containers
simultaneously. Note that you could try to do this yourself over the API,
but it's harder to trigger due to the added latency from going over
the API.

The error produced happens when the daemon tries to start the container
upon allocating the network resources:

attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

What happens here is the worker makes a network attachment request to
the manager. This is an async call which in the happy case would cause a
task to be placed on the node, which the worker is waiting for to get
the network configuration.
In the case of this race, the error occurs on the manager like this:

task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e

The task is not created and the worker times out waiting for the task.


The mitigation for this is to make sure that only one attachment request
is in flight for a given network at a time when the network doesn't
already exist on the node
. If the network already exists on the node
there is no need for synchronization because the network is already
allocated and on the node so there is no need to request it from the
manager.

This basically comes down to a race with Find(network) || Create(network) without any sort of synchronization.

Attachable networks are networks created on the cluster which can then
be attached to by non-swarm containers. These networks are lazily
created on the node that wants to attach to that network.

When no container is currently attached to one of these networks on a
node, and then multiple containers which want that network are started
concurrently, this can cause a race condition in the network attachment
where essentially we try to attach the same network to the node twice.

To easily reproduce this issue you must use a multi-node cluster with a
worker node that has lots of CPUs (I used a 36 CPU node).

Repro steps:

1. On manager, `docker network create -d overlay --attachable test`
2. On worker, `docker create --restart=always --network test busybox
top`, many times... 200 is a good number (but not much more due to
subnet size restrictions)
3. Restart the daemon

When the daemon restarts, it will attempt to start all those containers
simultaneously. Note that you could try to do this yourself over the API,
but it's harder to trigger due to the added latency from going over
the API.

The error produced happens when the daemon tries to start the container
upon allocating the network resources:

```
attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded
```

What happens here is the worker makes a network attachment request to
the manager. This is an async call which in the happy case would cause a
task to be placed on the node, which the worker is waiting for to get
the network configuration.
In the case of this race, the error ocurrs on the manager like this:

```
task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e
```

The task is not created and the worker times out waiting for the task.

---

The mitigation for this is to make sure that only one attachment reuest
is in flight for a given network at a time *when the network doesn't
already exist on the node*. If the network already exists on the node
there is no need for synchronization because the network is already
allocated and on the node so there is no need to request it from the
manager.

This basically comes down to a race with `Find(network) ||
Create(network)` without any sort of syncronization.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@fcrisciani
Copy link
Contributor

@cpuguy83 question just to clarify if I got the issue, is the issue basically triggered by making too many request for the same network from a worker so that the manager does not allocate a task in time on that worker?

@cpuguy83
Copy link
Member Author

cpuguy83 commented Feb 2, 2018

@fcrisciani No, it's because the manager errors out and never assigns the task. The worker is waiting for a task assignment that never comes and eventually times out.

@cpuguy83
Copy link
Member Author

cpuguy83 commented Feb 2, 2018

btw, I considered putting this in the cluster provider's AttachNetwork function, but then we end up loosing some control over when we lock.

@fcrisciani
Copy link
Contributor

@cpuguy83 I mean, this makes the client a good citizen but kind of feel like that there is a need for a fix on the manager also, if you receive 2 or more req for the same network should not stop allocating it, right?

@cpuguy83
Copy link
Member Author

cpuguy83 commented Feb 2, 2018

@fcrisciani It does stop allocating it, it throws the error saying it can't allocate. The issue isn't the concurrency on the manager, it's the concurrency in the worker.

Copy link
Contributor

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcrisciani
Copy link
Contributor

Move the discussion offline with @cpuguy83 and now the condition of the race is clear that is on the client side. 2 routines can race in the create network so the new lock will prevent that.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐸

@selansen
Copy link
Contributor

selansen commented Feb 2, 2018

LGTM

daemon.attachableNetworkLock.Lock(id)
defer daemon.attachableNetworkLock.Unlock(id)
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bear with me as I am new here .... I have a few questions:

  1. Is the purpose of the attachableNetworkLock being potentially nil to only perform the lock/test if the node is part of a swarm cluster? If so then:
    a. Is locking this function (only) really such a high cost to pay to make the test optional? Or is there some danger of deadlock when not in a cluster?
    b. Is it correct that if the node leaves the cluster it will still be acquiring this lock anyway?
  2. From what I can see in daemon.FindNetwork(), after return either n != nil or err != nil. This is perfectly reasonable behavior and what one should expect, IMHO. That being the case, this lock test could move in to the err != nil section above. This makes the logic a little cleaner.

Side note re: 2 -- it wouldn't be a bad idea to simplify the

    if err != nil { ...
    }
    if n != nil {...
    }

Code to be just:

    if err != nil { ... 
    } else { ....
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, the n != nil is an optimization since if it's not nil, then there is no need to lock.
I also despise else blocks. 🤷‍♂️
The reason to not implicitly lock is because this is extra serialization when it's not really needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod I get that the lock isn't needed if n != nil. Was just pointing out that that test was already effectively done above with the err != nil, so the body could be moved up there (w/ test for lock existence). Keeps the "network not found" logic in one place.

The if / else was a side-comment re: the way the existing code is already structured. I get the preference for avoiding "else" clauses in general. As a fresh reader of the code, I was expressing that it might be better in this case because if (x) {...} if (y) {...} leads me to believe that both x and y could be true at the same time for purposes of subsequent logic. (and that is not the case here) In any case it's immaterial for purposes of this fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if an early-return is possible, I'd try to avoid an else. I do agree that these "ifs" are a bit "iffy" 😅, e.g.:

if err != nil {
// We should always be able to find the network for a
// managed container.
if container.Managed {
return nil, nil, err
}
}

could be simplified to:

if err != nil && container.Managed {
	// We should always be able to find the network for a managed container.
	return nil, nil, err
}

The second if would become;

// If we found a network and if it is not dynamically created
// we should never attempt to attach to that network here.
if n != nil && (container.Managed || !n.Info().Dynamic()) {
	return n, nil, nil
}

Which leads to an interesting question: if the container is managed, we never attach, so, should this simply be this?:

if container.Managed {
	return n, nil, err
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above simplification is very appealing. Presumably it would be followed by an:

if n != nil && !n.Info().Dynamic() {
        return n, nil, nil
}

to handle the second test? Again, would be much more readable (to me anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctelfer-docker I agree (but also felt it being separate from this PR): would you be interested in opening a PR for that?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and will do.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yongtang yongtang merged commit 6987557 into moby:master Feb 5, 2018
@cpuguy83 cpuguy83 deleted the fix_attachable_network_race branch February 5, 2018 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants