🚀 Advanced: Allow awslogs to use non-blocking mode by IRCody · Pull Request #36522 · moby/moby · GitHub

IRCody · 2018-03-07T23:15:04Z

Allow awslogs to use non-blocking mode

When then non-blocking mode is specified, awslogs will:

- No longer potentially block calls to logstream.Log(), instead will
  return an error if the awslogs buffer is full. This has the effect of
  dropping log messages sent to awslogs.Log() that are made while the
  buffer is full.
- Wait to initialize the log stream until the first Log() call instead of in
  New(). This has the effect of allowing the container to start in
  the case where Cloudwatch Logs is unreachable.

Both of these changes require the --log-opt mode=non-blocking to be
explicitly set and do not modify the default behavior.

Signed-off-by: Cody Roseborough <crrosebo@amazon.com>

Fixes #33803

Since this has some changes in behavior I wanted to call out some specific scenarios here. All scenarios are talking about behavior when logopt mode=non-blocking is set. Default behavior should remain unchanged.

Scenario 1: Unable to connect to CloudWatch Logs on startup

Current behavior:

Fails to start container after the timeout is reached during the createLogStream call.

After this PR:

Container will start and initialize the log stream on the first call to Log() (and all subsequent calls until the log stream is created successfully). Errors here are not fatal to the container.

Scenario 2: CloudWatch Logs becomes unavailable and the container is stopped.

Current behavior:

The RingLogger Close() will call awslogs.Log() on all remaining messages in the buffer. Each call is potentially blocking on a channel write that ultimately depends on a call to CloudWatch Logs. These will complete but very slowly as it times out multiple calls to CloudWatch Logs.

After this PR:

awslogs.Log() is no longer blocking. If it's unable to add an item to its internal buffer it returns an error. This has the effect of ignoring log messages that come after the awslogs internal buffer is full.

samuelkarp

LGTM!

thaJeztah · 2018-03-09T15:58:52Z

ping @cpuguy83 @anusha-ragunathan PTAL

cpuguy83 · 2018-03-12T16:36:18Z

daemon/logger/awslogs/cloudwatchlogs.go

This extra locking seems excessive.
Since in the case that there is a create in flight that is taking some time this will just block until it's done.
When it's open, the call is fast anyway so doesn't really need the extra level of locking.

cpuguy83 · 2018-03-12T16:39:36Z

daemon/logger/awslogs/cloudwatchlogs_test.go

We should make sure this goroutine is started before testing below.

cpuguy83 · 2018-03-12T16:40:00Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Seems like errorCh and done can be merged.

cpuguy83 · 2018-03-12T16:41:36Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Does this need to be a time.After instead of just a default? Seems like we are trying to mitigate race conditions, but sleeping like this is always racey.

cpuguy83 · 2018-03-12T16:43:36Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Need to make sure goroutine is started.

codecov · 2018-03-19T22:35:36Z

Codecov Report

❗ No coverage uploaded for pull request base (master@765a9f3). Click here to learn what that means.
The diff coverage is 29.72%.

@@            Coverage Diff            @@
##             master   #36522   +/-   ##
=========================================
  Coverage          ?   35.07%           
=========================================
  Files             ?      614           
  Lines             ?    45730           
  Branches          ?        0           
=========================================
  Hits              ?    16038           
  Misses            ?    27585           
  Partials          ?     2107

IRCody · 2018-03-20T17:15:26Z

@cpuguy83: Sorry for my delay in addressing your comments. Can you take another look?

thaJeztah · 2018-03-21T16:58:18Z

Also; ping @anusha-ragunathan PTAL as well 🤗

anusha-ragunathan · 2018-03-29T20:03:11Z

daemon/logger/awslogs/cloudwatchlogs.go

Shouldnt this logstream create be only for the non-blocking case?
The blocking case is already covered in New

That makes sense. I changed this to check for nonblocking before this call.

anusha-ragunathan · 2018-03-29T20:10:01Z

daemon/logger/awslogs/cloudwatchlogs.go

For my clarification: Why do we need an open bool, when the closed bool can be used to indicate whether the logstream is open or not?

This was poor naming on my part. 'open' indicates that the log group and logstream have been created in cloudwatch while closed indicates that close() has been called. I renamed 'open' to 'created' to (hopefully) better reflect the intent.

anusha-ragunathan · 2018-03-29T20:15:25Z

Test failures are from known issues not related to this PR

anusha-ragunathan · 2018-03-29T20:34:31Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Useful to print err

I added the err to this output.

anusha-ragunathan · 2018-03-29T20:42:53Z

daemon/logger/awslogs/cwlogsiface_mock_test.go

createLogGroupCallCount and putLogEventsCallCount are unused atm. If they wont be used in this PR, I would prefer to remove them

Removed these.

anusha-ragunathan · 2018-04-04T19:39:28Z

daemon/logger/awslogs/cloudwatchlogs.go

nit: combine as if err := l.create(); err != nil {

Good catch, I changed this.

anusha-ragunathan · 2018-04-04T20:06:28Z

LGTM

anusha-ragunathan · 2018-04-05T19:37:18Z

Windows failure seems new, although looks unrelated to this PR.

Opened #36801 to track.

IRCody · 2018-04-05T23:50:21Z

@dnephin, @cpuguy83: Please take another look when you have a chance. Thanks!

cpuguy83

I'm kind of 50/50 on the approach.
This is relying on the log copier to log errors rather than take some other kind of action in order to be "non-blocking".... it's probably not a horrible thing to rely on but something could change and it would be difficult to know.

cpuguy83 · 2018-04-10T14:14:49Z

daemon/logger/awslogs/cloudwatchlogs.go

Shouldn't we still attempt a create here and just log if the create failed in the case of non blocking?

You could create it here. The side effect would be that a non-blocking container would spend a significant amount of time blocking on startup in the case where this api call fails. My thinking was that a user who explicitly specified non-blocking wouldn't want to delay startup significantly on timeouts to a log api.

At least this could do the create in a goroutine so it's not blocking the first log message?

Would this mean that the Log() function would have to return some sort of "not ready" error when called before the goroutine initialization is complete. In the RingLogger, messages that return errors are not re-added to the queue so this causes a race between the first call(s) to Log() and the initialization go routine that might result in losing Logs for a container where:

The logger initialized successfully.

The RingBuffer was of adequate size to hold all the messages in the time period between when the container started logging messages and the logger was initialized.

I mean, it could block (or just queue the message) like it does now in the case that the connection isn't available yet. Anyway, this is an optimization and not necessarily needed for this change.

Actually, since the connection doesn't need to be available anyway to just collect the log, it would probably be best to handle the connection completely asynchronously for the non-blocking mode.

So maybe worth it to make that change here.
The connection is needed on batch send, not on Log.

IRCody · 2018-04-11T22:37:38Z

This is relying on the log copier to log errors rather than take some other kind of action in order to be "non-blocking".... it's probably not a horrible thing to rely on but something could change and it would be difficult to know.

Can you expand on this? I am not sure I understand.

cpuguy83 · 2018-04-18T13:25:24Z

Can you expand on this? I am not sure I understand.

It's relying on the caller of Log(msg) to not retry on error or anything.
But also thinking about it more, with the error it's then going to block on the daemon being able to log the error (because currently the caller logs the error). It may be best to just log a debug message in the driver and return a nil in the non-blocking case (when the buffer is full).... maybe even add some metrics in there so admins can track this.

IRCody · 2018-04-20T19:40:17Z

It's relying on the caller of Log(msg) to not retry on error or anything.

Successive calls to Log() might all return errors but I don't see where its relying on the daemon to not attempt to log the same message multiple times. From Log()'s POV whether it is called with a new message or if it is a "retry" doesn't seem like it should matter. For easy reference, the Log() function in this PR is:

// Log submits messages for logging by an instance of the awslogs logging driver
func (l *logStream) Log(msg *logger.Message) error {
	// In the blocking case we have already called create in New()
	if l.logNonBlocking {
		if err := l.create(); err != nil {
			return err
		}
	}
	l.lock.RLock()
	defer l.lock.RUnlock()
	if l.closed {
		return errors.New("awslogs is closed")
	}
	if l.logNonBlocking {
		select {
		case l.messages <- msg:
			return nil
		default:
			return errors.New("awslogs buffer is full")
		}
	}
	l.messages <- msg
	return nil
}

Can you help me understand how it is relying on the caller to not retry?

But also thinking about it more, with the error it's then going to block on the daemon being able to log the error (because currently the caller logs the error). It may be best to just log a debug message in the driver and return a nil in the non-blocking case (when the buffer is full).... maybe even add some metrics in there so admins can track this.

It's unclear to me why returning an error from Log() is going to block the daemon. Why is this error different from other types of errors that might happen? Does the daemon expect no errors in the non-blocking case?

cpuguy83 · 2018-04-20T20:10:43Z

The container is blocked on I/O until the next message is read from the stdio streams of the container. So I'm referring to the container as blocking here, not the daemon.

…

On Fri, Apr 20, 2018 at 3:40 PM Cody Roseborough ***@***.***> wrote: It's relying on the caller of Log(msg) to not retry on error or anything. Successive calls to Log() might all return errors but I don't see where its relying on the daemon to not attempt to log the same message multiple times. From Log()'s POV whether it is called with a new message or if it is a "retry" doesn't seem like it should matter. For easy reference, the Log() function in this PR is: // Log submits messages for logging by an instance of the awslogs logging driverfunc (l *logStream) Log(msg *logger.Message) error { // In the blocking case we have already called create in New() if l.logNonBlocking { if err := l.create(); err != nil { return err } } l.lock.RLock() defer l.lock.RUnlock() if l.closed { return errors.New("awslogs is closed") } if l.logNonBlocking { select { case l.messages <- msg: return nil default: return errors.New("awslogs buffer is full") } } l.messages <- msg return nil } Can you help me understand how it is relying on the caller to not retry? But also thinking about it more, with the error it's then going to block on the daemon being able to log the error (because currently the caller logs the error). It may be best to just log a debug message in the driver and return a nil in the non-blocking case (when the buffer is full).... maybe even add some metrics in there so admins can track this. It's unclear to me why returning an error from Log() is going to block the daemon. Why is this error different from other types of errors that might happen? Does the daemon expect no errors in the non-blocking case? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwxZqx5OmS2LhFFKo1Jnmw8338dTS7Tks5tqjmugaJpZM4ShoUu> .

IRCody · 2018-04-20T21:26:29Z

The container is blocked on I/O until the next message is read from the
stdio streams of the container. So I'm referring to the container as
blocking here, not the daemon.

Can replace daemon with container in my question and it still applies:

It's unclear to me why returning an error from Log() is going to block the ~~daemon~~ container? I think I might have a misunderstanding of what's going on. Maybe it will be helpful to state my assumptions to see what I am missing.

My understanding was that the RingLogger sat in between the containers I/O and the Log driver, allowing the containers I/O to be read from continuously independent from calls to the logdriver, which runs in a separate goroutine. This results in Blocking the call to Log() not blocking messages being enqueued into the ring buffer with the effect that messages might be overwritten in the ring buffer if Log() is taking to long. That belief seems incompatible with what you're saying. Can you help me understand what I'm missing?

cpuguy83 · 2018-04-23T17:02:08Z

You're right, the ring logger will deal with it.

IRCody · 2018-04-23T21:28:37Z

Thanks for clarifying @cpuguy83. Where does this leave the PR? Are there any additional changes you (or anyone) thinks would make this a better change?

IRCody · 2018-04-25T23:49:13Z

@cpuguy83: I made an attempt at making the log stream/log group initialization async for non-blocking like discussed in some earlier comments. Can you take please take a look? Thanks!

cpuguy83 · 2018-04-26T14:00:06Z

daemon/logger/awslogs/cloudwatchlogs.go:173:10:warning: if block ends with a return statement, so drop this else and outdent its block (golint)

cpuguy83 · 2018-04-26T14:01:18Z

daemon/logger/awslogs/cloudwatchlogs.go

I added a 1 second sleep. Do you think this is long enough?

We should probably backoff up to some max(30s?)
Also needs to log the errors.

Modified it to log the error and to double backoff times until it is > 30s.

With doubling each time, 32s would be optimal :)

I changed it to 32.

cpuguy83 · 2018-04-26T14:08:20Z

daemon/logger/awslogs/cloudwatchlogs_test.go

This can be more time just to make sure it's not some crazy timing issue.
Probably 30s is plenty.

cpuguy83 · 2018-04-26T14:08:54Z

daemon/logger/awslogs/cloudwatchlogs_test.go

t.Fatal(err) would be good here.

cpuguy83 · 2018-04-26T14:09:29Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Longer timeout (suggest 30s)

cpuguy83 · 2018-04-26T14:10:27Z

daemon/logger/awslogs/cloudwatchlogs_test.go

t.Fatal(err) is sufficient.

cpuguy83 · 2018-04-26T14:10:47Z

daemon/logger/awslogs/cloudwatchlogs_test.go

"timed out waiting for read"

cpuguy83 · 2018-04-26T14:11:41Z

daemon/logger/awslogs/cloudwatchlogs_test.go

Maybe grab the error from the channel here and put it in the fail message.

IRCody · 2018-04-26T17:39:35Z

@cpuguy83: Thanks for taking a look. I made the changes you suggested.

cpuguy83 · 2018-04-26T20:53:54Z

daemon/logger/awslogs/cloudwatchlogs.go

Nit, logrus.WithError(err).Error(...)

cpuguy83 · 2018-04-26T20:55:26Z

daemon/logger/awslogs/cloudwatchlogs.go

Oh, just realized we need a way to cancel this loop

That's a good point. I added a check to break out of the loop if the Close() method was called on the logger.

cpuguy83 · 2018-04-26T20:58:52Z

Sorry for the back and forth here, getting really close. Thanks for updating.

cpuguy83 · 2018-04-26T21:12:25Z

daemon/logger/awslogs/cloudwatchlogs.go

If we do this, then we need some synchronization.

You mean locking before we read this?

That or use a channel/some other form of syncronization.

I added RLock around it.

cpuguy83

One more nit, but otherwise LGTM

cpuguy83 · 2018-04-27T12:52:51Z

daemon/logger/awslogs/cloudwatchlogs.go

Can you add the container ID to log (.WithField(info.ContainerID)) and some extra details that it is going to try again in backoff seconds? Maybe change the wording Error while trying to initialize awslogs so the log implies that it's not giving up.

Added container Id, container name as well as making the message Error while trying to initialize awslogs. Retrying in [backoff] seconds".

When then non-blocking mode is specified, awslogs will: - No longer potentially block calls to logstream.Log(), instead will return an error if the awslogs buffer is full. This has the effect of dropping log messages sent to awslogs.Log() that are made while the buffer is full. - Wait to initialize the log stream until the first Log() call instead of in New(). This has the effect of allowing the container to start in the case where Cloudwatch Logs is unreachable. Both of these changes require the --log-opt mode=non-blocking to be explicitly set and do not modify the default behavior. Signed-off-by: Cody Roseborough <crrosebo@amazon.com>

thaJeztah · 2018-04-30T09:51:26Z

ping @anusha-ragunathan still LGTY? looks like this is ready to go

anusha-ragunathan · 2018-04-30T20:21:24Z

daemon/logger/awslogs/cloudwatchlogs.go

+					break
+				}
+
+				time.Sleep(time.Duration(backoff) * time.Second)


For my clarification: Does the backoff make sense for all types of errors from create? Or is there a way to filter by errors that are known to heal with time? Another way to look at it is, if there are errors that are known to not heal even with backoff, then we should exit right away.

This is an optimization and can be addressed in a follow-up PR, if applicable.

This is a good question.

The classes of errors I can think of all seem to have the ability to heal over time. The classes I can think of are:

Network availability

Service errors

Service Limits

Credentials/policy issue

Do you see something outside of these classes or a way that one of these won't be able to possibly heal?

IMO, Credentials and policy issues dont heal over time. They need manual intervention to fix the issue. Also service errors is a broad category. There's a possibility that there are some types of service errors that dont auto-heal with time.

Credential and permission issues can be healed out-of-band by changes in the IAM policy. They won't automatically heal, but can be fixed while the container is running.

By "service errors", @IRCody is talking about the kinds of errors that would be fixed by AWS, also out-of-band with respect to the container. Network availability and service errors are the general issue that we're trying to address with this code.

Sounds good.

Agree, all things that can be fixed outside of the scope of the container (e.g. when error logs start showing up).
On a separate note it might be nice (in a another exercise) to track metrics for errors here.

anusha-ragunathan · 2018-05-01T17:41:38Z

LGTM

cpuguy83 · 2018-05-01T20:31:07Z

Thanks @IRCody

GordonTheTurtle added the status/0-triage label Mar 7, 2018

IRCody force-pushed the awslogs-non-blocking branch from 4194d6e to 58eb211 Compare March 7, 2018 23:35

samuelkarp approved these changes Mar 8, 2018

View reviewed changes

IRCody mentioned this pull request Mar 8, 2018

ECS agent fails to degrade gracefully if Cloudwatch Logs down aws/amazon-ecs-agent#1088

Closed

yongtang added status/2-code-review and removed status/0-triage labels Mar 8, 2018

thaJeztah added the area/logging label Mar 9, 2018

thaJeztah added the rebuild/* label Mar 9, 2018

GordonTheTurtle removed the rebuild/* label Mar 9, 2018

thaJeztah added impact/changelog docs/revisit labels Mar 9, 2018

cpuguy83 requested changes Mar 12, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from 58eb211 to c919298 Compare March 19, 2018 22:35

GordonTheTurtle assigned dnephin Mar 22, 2018

anusha-ragunathan reviewed Mar 29, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from c919298 to 799e381 Compare April 4, 2018 18:24

anusha-ragunathan reviewed Apr 4, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from 799e381 to 7b94321 Compare April 4, 2018 20:56

cpuguy83 requested changes Apr 10, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from 7b94321 to f156071 Compare April 25, 2018 23:46

cpuguy83 requested changes Apr 26, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from f156071 to 974b0e5 Compare April 26, 2018 17:37

IRCody force-pushed the awslogs-non-blocking branch 2 times, most recently from ede23a5 to 9440f08 Compare April 26, 2018 20:33

cpuguy83 reviewed Apr 26, 2018

View reviewed changes

daemon/logger/awslogs/cloudwatchlogs.go Outdated

Copy link

Member

cpuguy83 Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, logrus.WithError(err).Error(...)

cpuguy83 reviewed Apr 26, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from 9440f08 to 830da1d Compare April 26, 2018 21:01

cpuguy83 reviewed Apr 26, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from 830da1d to f262b70 Compare April 26, 2018 21:55

cpuguy83 approved these changes Apr 27, 2018

View reviewed changes

IRCody force-pushed the awslogs-non-blocking branch from f262b70 to c7e3799 Compare April 27, 2018 17:59

anusha-ragunathan reviewed Apr 30, 2018

View reviewed changes

cpuguy83 merged commit fe2d3a1 into moby:master May 1, 2018

danieladams456 mentioned this pull request Nov 25, 2020

Non-blocking default for log drivers? #41714

Closed

Feb	MAR	Apr
	14
2025	2026	2027

Conversation

IRCody commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelkarp left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Mar 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

IRCody commented Mar 20, 2018

Uh oh!

thaJeztah commented Mar 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anusha-ragunathan commented Mar 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anusha-ragunathan commented Apr 4, 2018

Uh oh!

anusha-ragunathan commented Apr 5, 2018

Uh oh!

IRCody commented Apr 5, 2018

Uh oh!

cpuguy83 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IRCody commented Apr 11, 2018

Uh oh!

cpuguy83 commented Apr 18, 2018

Uh oh!

IRCody commented Apr 20, 2018

Uh oh!

cpuguy83 commented Apr 20, 2018 via email

Uh oh!

IRCody commented Mar 7, 2018 •

edited

Loading

codecov bot commented Mar 19, 2018 •

edited

Loading