The Wayback Machine - https://web.archive.org/web/20260314102111/https://github.com/moby/moby/pull/36522
Skip to content

Allow awslogs to use non-blocking mode#36522

Merged
cpuguy83 merged 1 commit intomoby:masterfrom
IRCody:awslogs-non-blocking
May 1, 2018
Merged

Allow awslogs to use non-blocking mode#36522
cpuguy83 merged 1 commit intomoby:masterfrom
IRCody:awslogs-non-blocking

Conversation

@IRCody
Copy link
Contributor

@IRCody IRCody commented Mar 7, 2018

Allow awslogs to use non-blocking mode

When then non-blocking mode is specified, awslogs will:

- No longer potentially block calls to logstream.Log(), instead will
  return an error if the awslogs buffer is full. This has the effect of
  dropping log messages sent to awslogs.Log() that are made while the
  buffer is full.
- Wait to initialize the log stream until the first Log() call instead of in
  New(). This has the effect of allowing the container to start in
  the case where Cloudwatch Logs is unreachable.

Both of these changes require the --log-opt mode=non-blocking to be
explicitly set and do not modify the default behavior.

Signed-off-by: Cody Roseborough <crrosebo@amazon.com>

Fixes #33803

Since this has some changes in behavior I wanted to call out some specific scenarios here. All scenarios are talking about behavior when logopt mode=non-blocking is set. Default behavior should remain unchanged.

Scenario 1: Unable to connect to CloudWatch Logs on startup

Current behavior:

Fails to start container after the timeout is reached during the createLogStream call.

After this PR:

Container will start and initialize the log stream on the first call to Log() (and all subsequent calls until the log stream is created successfully). Errors here are not fatal to the container.

Scenario 2: CloudWatch Logs becomes unavailable and the container is stopped.

Current behavior:

The RingLogger Close() will call awslogs.Log() on all remaining messages in the buffer. Each call is potentially blocking on a channel write that ultimately depends on a call to CloudWatch Logs. These will complete but very slowly as it times out multiple calls to CloudWatch Logs.

After this PR:

awslogs.Log() is no longer blocking. If it's unable to add an item to its internal buffer it returns an error. This has the effect of ignoring log messages that come after the awslogs internal buffer is full.

Copy link
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thaJeztah
Copy link
Member

ping @cpuguy83 @anusha-ragunathan PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra locking seems excessive.
Since in the case that there is a create in flight that is taking some time this will just block until it's done.
When it's open, the call is fast anyway so doesn't really need the extra level of locking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure this goroutine is started before testing below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like errorCh and done can be merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a time.After instead of just a default? Seems like we are trying to mitigate race conditions, but sleeping like this is always racey.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make sure goroutine is started.

@IRCody IRCody force-pushed the awslogs-non-blocking branch from 58eb211 to c919298 Compare March 19, 2018 22:35
@codecov
Copy link

codecov bot commented Mar 19, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@765a9f3). Click here to learn what that means.
The diff coverage is 29.72%.

@@            Coverage Diff            @@
##             master   #36522   +/-   ##
=========================================
  Coverage          ?   35.07%           
=========================================
  Files             ?      614           
  Lines             ?    45730           
  Branches          ?        0           
=========================================
  Hits              ?    16038           
  Misses            ?    27585           
  Partials          ?     2107

@IRCody
Copy link
Contributor Author

IRCody commented Mar 20, 2018

@cpuguy83: Sorry for my delay in addressing your comments. Can you take another look?

@thaJeztah
Copy link
Member

Also; ping @anusha-ragunathan PTAL as well 🤗

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this logstream create be only for the non-blocking case?
The blocking case is already covered in New

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed this to check for nonblocking before this call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my clarification: Why do we need an open bool, when the closed bool can be used to indicate whether the logstream is open or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was poor naming on my part. 'open' indicates that the log group and logstream have been created in cloudwatch while closed indicates that close() has been called. I renamed 'open' to 'created' to (hopefully) better reflect the intent.

@anusha-ragunathan
Copy link
Contributor

Test failures are from known issues not related to this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to print err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the err to this output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createLogGroupCallCount and putLogEventsCallCount are unused atm. If they wont be used in this PR, I would prefer to remove them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these.

@IRCody IRCody force-pushed the awslogs-non-blocking branch from c919298 to 799e381 Compare April 4, 2018 18:24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: combine as if err := l.create(); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I changed this.

@anusha-ragunathan
Copy link
Contributor

LGTM

@IRCody IRCody force-pushed the awslogs-non-blocking branch from 799e381 to 7b94321 Compare April 4, 2018 20:56
@anusha-ragunathan
Copy link
Contributor

Windows failure seems new, although looks unrelated to this PR.

Opened #36801 to track.

@IRCody
Copy link
Contributor Author

IRCody commented Apr 5, 2018

@dnephin, @cpuguy83: Please take another look when you have a chance. Thanks!

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of 50/50 on the approach.
This is relying on the log copier to log errors rather than take some other kind of action in order to be "non-blocking".... it's probably not a horrible thing to rely on but something could change and it would be difficult to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still attempt a create here and just log if the create failed in the case of non blocking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could create it here. The side effect would be that a non-blocking container would spend a significant amount of time blocking on startup in the case where this api call fails. My thinking was that a user who explicitly specified non-blocking wouldn't want to delay startup significantly on timeouts to a log api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least this could do the create in a goroutine so it's not blocking the first log message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that the Log() function would have to return some sort of "not ready" error when called before the goroutine initialization is complete. In the RingLogger, messages that return errors are not re-added to the queue so this causes a race between the first call(s) to Log() and the initialization go routine that might result in losing Logs for a container where:

  1. The logger initialized successfully.
  2. The RingBuffer was of adequate size to hold all the messages in the time period between when the container started logging messages and the logger was initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it could block (or just queue the message) like it does now in the case that the connection isn't available yet. Anyway, this is an optimization and not necessarily needed for this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since the connection doesn't need to be available anyway to just collect the log, it would probably be best to handle the connection completely asynchronously for the non-blocking mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe worth it to make that change here.
The connection is needed on batch send, not on Log.

@IRCody
Copy link
Contributor Author

IRCody commented Apr 11, 2018

This is relying on the log copier to log errors rather than take some other kind of action in order to be "non-blocking".... it's probably not a horrible thing to rely on but something could change and it would be difficult to know.

Can you expand on this? I am not sure I understand.

@cpuguy83
Copy link
Member

Can you expand on this? I am not sure I understand.

It's relying on the caller of Log(msg) to not retry on error or anything.
But also thinking about it more, with the error it's then going to block on the daemon being able to log the error (because currently the caller logs the error). It may be best to just log a debug message in the driver and return a nil in the non-blocking case (when the buffer is full).... maybe even add some metrics in there so admins can track this.

@IRCody
Copy link
Contributor Author

IRCody commented Apr 20, 2018

It's relying on the caller of Log(msg) to not retry on error or anything.

Successive calls to Log() might all return errors but I don't see where its relying on the daemon to not attempt to log the same message multiple times. From Log()'s POV whether it is called with a new message or if it is a "retry" doesn't seem like it should matter. For easy reference, the Log() function in this PR is:

// Log submits messages for logging by an instance of the awslogs logging driver
func (l *logStream) Log(msg *logger.Message) error {
	// In the blocking case we have already called create in New()
	if l.logNonBlocking {
		if err := l.create(); err != nil {
			return err
		}
	}
	l.lock.RLock()
	defer l.lock.RUnlock()
	if l.closed {
		return errors.New("awslogs is closed")
	}
	if l.logNonBlocking {
		select {
		case l.messages <- msg:
			return nil
		default:
			return errors.New("awslogs buffer is full")
		}
	}
	l.messages <- msg
	return nil
}

Can you help me understand how it is relying on the caller to not retry?

But also thinking about it more, with the error it's then going to block on the daemon being able to log the error (because currently the caller logs the error). It may be best to just log a debug message in the driver and return a nil in the non-blocking case (when the buffer is full).... maybe even add some metrics in there so admins can track this.

It's unclear to me why returning an error from Log() is going to block the daemon. Why is this error different from other types of errors that might happen? Does the daemon expect no errors in the non-blocking case?

@cpuguy83
Copy link
Member

cpuguy83 commented Apr 20, 2018 via email

@IRCody
Copy link
Contributor Author

IRCody commented Apr 20, 2018

The container is blocked on I/O until the next message is read from the
stdio streams of the container. So I'm referring to the container as
blocking here, not the daemon.

Can replace daemon with container in my question and it still applies:

It's unclear to me why returning an error from Log() is going to block the daemon container? I think I might have a misunderstanding of what's going on. Maybe it will be helpful to state my assumptions to see what I am missing.

My understanding was that the RingLogger sat in between the containers I/O and the Log driver, allowing the containers I/O to be read from continuously independent from calls to the logdriver, which runs in a separate goroutine. This results in Blocking the call to Log() not blocking messages being enqueued into the ring buffer with the effect that messages might be overwritten in the ring buffer if Log() is taking to long. That belief seems incompatible with what you're saying. Can you help me understand what I'm missing?

@cpuguy83
Copy link
Member

You're right, the ring logger will deal with it.

@IRCody
Copy link
Contributor Author

IRCody commented Apr 23, 2018

Thanks for clarifying @cpuguy83. Where does this leave the PR? Are there any additional changes you (or anyone) thinks would make this a better change?

@IRCody IRCody force-pushed the awslogs-non-blocking branch from 7b94321 to f156071 Compare April 25, 2018 23:46
@IRCody
Copy link
Contributor Author

IRCody commented Apr 25, 2018

@cpuguy83: I made an attempt at making the log stream/log group initialization async for non-blocking like discussed in some earlier comments. Can you take please take a look? Thanks!

@cpuguy83
Copy link
Member

daemon/logger/awslogs/cloudwatchlogs.go:173:10:warning: if block ends with a return statement, so drop this else and outdent its block (golint)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 1 second sleep. Do you think this is long enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably backoff up to some max(30s?)
Also needs to log the errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified it to log the error and to double backoff times until it is > 30s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With doubling each time, 32s would be optimal :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be more time just to make sure it's not some crazy timing issue.
Probably 30s is plenty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatal(err) would be good here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer timeout (suggest 30s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatal(err) is sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"timed out waiting for read"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe grab the error from the channel here and put it in the fail message.

@IRCody IRCody force-pushed the awslogs-non-blocking branch from f156071 to 974b0e5 Compare April 26, 2018 17:37
@IRCody
Copy link
Contributor Author

IRCody commented Apr 26, 2018

@cpuguy83: Thanks for taking a look. I made the changes you suggested.

@IRCody IRCody force-pushed the awslogs-non-blocking branch 2 times, most recently from ede23a5 to 9440f08 Compare April 26, 2018 20:33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, logrus.WithError(err).Error(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just realized we need a way to cancel this loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I added a check to break out of the loop if the Close() method was called on the logger.

@cpuguy83
Copy link
Member

Sorry for the back and forth here, getting really close. Thanks for updating.

@IRCody IRCody force-pushed the awslogs-non-blocking branch from 9440f08 to 830da1d Compare April 26, 2018 21:01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, then we need some synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean locking before we read this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That or use a channel/some other form of syncronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added RLock around it.

@IRCody IRCody force-pushed the awslogs-non-blocking branch from 830da1d to f262b70 Compare April 26, 2018 21:55
Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit, but otherwise LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the container ID to log (.WithField(info.ContainerID)) and some extra details that it is going to try again in backoff seconds? Maybe change the wording Error while trying to initialize awslogs so the log implies that it's not giving up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added container Id, container name as well as making the message Error while trying to initialize awslogs. Retrying in [backoff] seconds".

When then non-blocking mode is specified, awslogs will:

- No longer potentially block calls to logstream.Log(), instead will
  return an error if the awslogs buffer is full. This has the effect of
  dropping log messages sent to awslogs.Log() that are made while the
  buffer is full.
- Wait to initialize the log stream until the first Log() call instead of in
  New(). This has the effect of allowing the container to start in
  the case where Cloudwatch Logs is unreachable.

Both of these changes require the --log-opt mode=non-blocking to be
explicitly set and do not modify the default behavior.

Signed-off-by: Cody Roseborough <crrosebo@amazon.com>
@IRCody IRCody force-pushed the awslogs-non-blocking branch from f262b70 to c7e3799 Compare April 27, 2018 17:59
@thaJeztah
Copy link
Member

ping @anusha-ragunathan still LGTY? looks like this is ready to go

break
}

time.Sleep(time.Duration(backoff) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my clarification: Does the backoff make sense for all types of errors from create? Or is there a way to filter by errors that are known to heal with time? Another way to look at it is, if there are errors that are known to not heal even with backoff, then we should exit right away.

This is an optimization and can be addressed in a follow-up PR, if applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question.

The classes of errors I can think of all seem to have the ability to heal over time. The classes I can think of are:

  • Network availability
  • Service errors
  • Service Limits
  • Credentials/policy issue

Do you see something outside of these classes or a way that one of these won't be able to possibly heal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, Credentials and policy issues dont heal over time. They need manual intervention to fix the issue. Also service errors is a broad category. There's a possibility that there are some types of service errors that dont auto-heal with time.

Copy link
Member

@samuelkarp samuelkarp May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credential and permission issues can be healed out-of-band by changes in the IAM policy. They won't automatically heal, but can be fixed while the container is running.

By "service errors", @IRCody is talking about the kinds of errors that would be fixed by AWS, also out-of-band with respect to the container. Network availability and service errors are the general issue that we're trying to address with this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, all things that can be fixed outside of the scope of the container (e.g. when error logs start showing up).
On a separate note it might be nice (in a another exercise) to track metrics for errors here.

@anusha-ragunathan
Copy link
Contributor

LGTM

@cpuguy83 cpuguy83 merged commit fe2d3a1 into moby:master May 1, 2018
@cpuguy83
Copy link
Member

cpuguy83 commented May 1, 2018

Thanks @IRCody

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants