Allow awslogs to use non-blocking mode#36522
Conversation
4194d6e to
58eb211
Compare
|
ping @cpuguy83 @anusha-ragunathan PTAL |
There was a problem hiding this comment.
This extra locking seems excessive.
Since in the case that there is a create in flight that is taking some time this will just block until it's done.
When it's open, the call is fast anyway so doesn't really need the extra level of locking.
There was a problem hiding this comment.
We should make sure this goroutine is started before testing below.
There was a problem hiding this comment.
Seems like errorCh and done can be merged.
There was a problem hiding this comment.
Does this need to be a time.After instead of just a default? Seems like we are trying to mitigate race conditions, but sleeping like this is always racey.
There was a problem hiding this comment.
Need to make sure goroutine is started.
58eb211 to
c919298
Compare
Codecov Report
@@ Coverage Diff @@
## master #36522 +/- ##
=========================================
Coverage ? 35.07%
=========================================
Files ? 614
Lines ? 45730
Branches ? 0
=========================================
Hits ? 16038
Misses ? 27585
Partials ? 2107 |
|
@cpuguy83: Sorry for my delay in addressing your comments. Can you take another look? |
|
Also; ping @anusha-ragunathan PTAL as well 🤗 |
There was a problem hiding this comment.
Shouldnt this logstream create be only for the non-blocking case?
The blocking case is already covered in New
There was a problem hiding this comment.
That makes sense. I changed this to check for nonblocking before this call.
There was a problem hiding this comment.
For my clarification: Why do we need an open bool, when the closed bool can be used to indicate whether the logstream is open or not?
There was a problem hiding this comment.
This was poor naming on my part. 'open' indicates that the log group and logstream have been created in cloudwatch while closed indicates that close() has been called. I renamed 'open' to 'created' to (hopefully) better reflect the intent.
|
Test failures are from known issues not related to this PR |
There was a problem hiding this comment.
Useful to print err
There was a problem hiding this comment.
I added the err to this output.
There was a problem hiding this comment.
createLogGroupCallCount and putLogEventsCallCount are unused atm. If they wont be used in this PR, I would prefer to remove them
c919298 to
799e381
Compare
There was a problem hiding this comment.
nit: combine as if err := l.create(); err != nil {
There was a problem hiding this comment.
Good catch, I changed this.
|
LGTM |
799e381 to
7b94321
Compare
|
Windows failure seems new, although looks unrelated to this PR. Opened #36801 to track. |
cpuguy83
left a comment
There was a problem hiding this comment.
I'm kind of 50/50 on the approach.
This is relying on the log copier to log errors rather than take some other kind of action in order to be "non-blocking".... it's probably not a horrible thing to rely on but something could change and it would be difficult to know.
There was a problem hiding this comment.
Shouldn't we still attempt a create here and just log if the create failed in the case of non blocking?
There was a problem hiding this comment.
You could create it here. The side effect would be that a non-blocking container would spend a significant amount of time blocking on startup in the case where this api call fails. My thinking was that a user who explicitly specified non-blocking wouldn't want to delay startup significantly on timeouts to a log api.
There was a problem hiding this comment.
At least this could do the create in a goroutine so it's not blocking the first log message?
There was a problem hiding this comment.
Would this mean that the Log() function would have to return some sort of "not ready" error when called before the goroutine initialization is complete. In the RingLogger, messages that return errors are not re-added to the queue so this causes a race between the first call(s) to Log() and the initialization go routine that might result in losing Logs for a container where:
- The logger initialized successfully.
- The RingBuffer was of adequate size to hold all the messages in the time period between when the container started logging messages and the logger was initialized.
There was a problem hiding this comment.
I mean, it could block (or just queue the message) like it does now in the case that the connection isn't available yet. Anyway, this is an optimization and not necessarily needed for this change.
There was a problem hiding this comment.
Actually, since the connection doesn't need to be available anyway to just collect the log, it would probably be best to handle the connection completely asynchronously for the non-blocking mode.
There was a problem hiding this comment.
So maybe worth it to make that change here.
The connection is needed on batch send, not on Log.
Can you expand on this? I am not sure I understand. |
It's relying on the caller of |
Successive calls to Log() might all return errors but I don't see where its relying on the daemon to not attempt to log the same message multiple times. From Log()'s POV whether it is called with a new message or if it is a "retry" doesn't seem like it should matter. For easy reference, the Log() function in this PR is: // Log submits messages for logging by an instance of the awslogs logging driver
func (l *logStream) Log(msg *logger.Message) error {
// In the blocking case we have already called create in New()
if l.logNonBlocking {
if err := l.create(); err != nil {
return err
}
}
l.lock.RLock()
defer l.lock.RUnlock()
if l.closed {
return errors.New("awslogs is closed")
}
if l.logNonBlocking {
select {
case l.messages <- msg:
return nil
default:
return errors.New("awslogs buffer is full")
}
}
l.messages <- msg
return nil
}Can you help me understand how it is relying on the caller to not retry?
It's unclear to me why returning an error from Log() is going to block the daemon. Why is this error different from other types of errors that might happen? Does the daemon expect no errors in the non-blocking case? |
|
The container is blocked on I/O until the next message is read from the
stdio streams of the container. So I'm referring to the container as
blocking here, not the daemon.
…On Fri, Apr 20, 2018 at 3:40 PM Cody Roseborough ***@***.***> wrote:
It's relying on the caller of Log(msg) to not retry on error or anything.
Successive calls to Log() might all return errors but I don't see where
its relying on the daemon to not attempt to log the same message multiple
times. From Log()'s POV whether it is called with a new message or if it is
a "retry" doesn't seem like it should matter. For easy reference, the Log()
function in this PR is:
// Log submits messages for logging by an instance of the awslogs logging driverfunc (l *logStream) Log(msg *logger.Message) error
{
// In the blocking case we have already called create in New()
if l.logNonBlocking {
if err := l.create(); err != nil
{
return err
}
}
l.lock.RLock()
defer l.lock.RUnlock()
if l.closed {
return errors.New("awslogs is closed")
}
if l.logNonBlocking {
select {
case l.messages <- msg:
return nil
default:
return errors.New("awslogs buffer is full")
}
}
l.messages <- msg
return nil
}
Can you help me understand how it is relying on the caller to not retry?
But also thinking about it more, with the error it's then going to block
on the daemon being able to log the error (because currently the caller
logs the error). It may be best to just log a debug message in the driver
and return a nil in the non-blocking case (when the buffer is full)....
maybe even add some metrics in there so admins can track this.
It's unclear to me why returning an error from Log() is going to block the
daemon. Why is this error different from other types of errors that might
happen? Does the daemon expect no errors in the non-blocking case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#36522 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwxZqx5OmS2LhFFKo1Jnmw8338dTS7Tks5tqjmugaJpZM4ShoUu>
.
|
Can replace daemon with container in my question and it still applies: It's unclear to me why returning an error from Log() is going to block the My understanding was that the RingLogger sat in between the containers I/O and the Log driver, allowing the containers I/O to be read from continuously independent from calls to the logdriver, which runs in a separate goroutine. This results in Blocking the call to Log() not blocking messages being enqueued into the ring buffer with the effect that messages might be overwritten in the ring buffer if Log() is taking to long. That belief seems incompatible with what you're saying. Can you help me understand what I'm missing? |
|
You're right, the ring logger will deal with it. |
|
Thanks for clarifying @cpuguy83. Where does this leave the PR? Are there any additional changes you (or anyone) thinks would make this a better change? |
7b94321 to
f156071
Compare
|
@cpuguy83: I made an attempt at making the log stream/log group initialization async for non-blocking like discussed in some earlier comments. Can you take please take a look? Thanks! |
|
There was a problem hiding this comment.
I added a 1 second sleep. Do you think this is long enough?
There was a problem hiding this comment.
We should probably backoff up to some max(30s?)
Also needs to log the errors.
There was a problem hiding this comment.
Modified it to log the error and to double backoff times until it is > 30s.
There was a problem hiding this comment.
With doubling each time, 32s would be optimal :)
There was a problem hiding this comment.
This can be more time just to make sure it's not some crazy timing issue.
Probably 30s is plenty.
There was a problem hiding this comment.
t.Fatal(err) would be good here.
There was a problem hiding this comment.
Maybe grab the error from the channel here and put it in the fail message.
f156071 to
974b0e5
Compare
|
@cpuguy83: Thanks for taking a look. I made the changes you suggested. |
ede23a5 to
9440f08
Compare
There was a problem hiding this comment.
Nit, logrus.WithError(err).Error(...)
There was a problem hiding this comment.
Oh, just realized we need a way to cancel this loop
There was a problem hiding this comment.
That's a good point. I added a check to break out of the loop if the Close() method was called on the logger.
|
Sorry for the back and forth here, getting really close. Thanks for updating. |
9440f08 to
830da1d
Compare
There was a problem hiding this comment.
If we do this, then we need some synchronization.
There was a problem hiding this comment.
You mean locking before we read this?
There was a problem hiding this comment.
That or use a channel/some other form of syncronization.
There was a problem hiding this comment.
I added RLock around it.
830da1d to
f262b70
Compare
cpuguy83
left a comment
There was a problem hiding this comment.
One more nit, but otherwise LGTM
There was a problem hiding this comment.
Can you add the container ID to log (.WithField(info.ContainerID)) and some extra details that it is going to try again in backoff seconds? Maybe change the wording Error while trying to initialize awslogs so the log implies that it's not giving up.
There was a problem hiding this comment.
Added container Id, container name as well as making the message Error while trying to initialize awslogs. Retrying in [backoff] seconds".
When then non-blocking mode is specified, awslogs will: - No longer potentially block calls to logstream.Log(), instead will return an error if the awslogs buffer is full. This has the effect of dropping log messages sent to awslogs.Log() that are made while the buffer is full. - Wait to initialize the log stream until the first Log() call instead of in New(). This has the effect of allowing the container to start in the case where Cloudwatch Logs is unreachable. Both of these changes require the --log-opt mode=non-blocking to be explicitly set and do not modify the default behavior. Signed-off-by: Cody Roseborough <crrosebo@amazon.com>
f262b70 to
c7e3799
Compare
|
ping @anusha-ragunathan still LGTY? looks like this is ready to go |
| break | ||
| } | ||
|
|
||
| time.Sleep(time.Duration(backoff) * time.Second) |
There was a problem hiding this comment.
For my clarification: Does the backoff make sense for all types of errors from create? Or is there a way to filter by errors that are known to heal with time? Another way to look at it is, if there are errors that are known to not heal even with backoff, then we should exit right away.
This is an optimization and can be addressed in a follow-up PR, if applicable.
There was a problem hiding this comment.
This is a good question.
The classes of errors I can think of all seem to have the ability to heal over time. The classes I can think of are:
- Network availability
- Service errors
- Service Limits
- Credentials/policy issue
Do you see something outside of these classes or a way that one of these won't be able to possibly heal?
There was a problem hiding this comment.
IMO, Credentials and policy issues dont heal over time. They need manual intervention to fix the issue. Also service errors is a broad category. There's a possibility that there are some types of service errors that dont auto-heal with time.
There was a problem hiding this comment.
Credential and permission issues can be healed out-of-band by changes in the IAM policy. They won't automatically heal, but can be fixed while the container is running.
By "service errors", @IRCody is talking about the kinds of errors that would be fixed by AWS, also out-of-band with respect to the container. Network availability and service errors are the general issue that we're trying to address with this code.
There was a problem hiding this comment.
Agree, all things that can be fixed outside of the scope of the container (e.g. when error logs start showing up).
On a separate note it might be nice (in a another exercise) to track metrics for errors here.
|
LGTM |
|
Thanks @IRCody |


Fixes #33803
Since this has some changes in behavior I wanted to call out some specific scenarios here. All scenarios are talking about behavior when
logopt mode=non-blockingis set. Default behavior should remain unchanged.Scenario 1: Unable to connect to CloudWatch Logs on startup
Current behavior:
Fails to start container after the timeout is reached during the createLogStream call.
After this PR:
Container will start and initialize the log stream on the first call to Log() (and all subsequent calls until the log stream is created successfully). Errors here are not fatal to the container.
Scenario 2: CloudWatch Logs becomes unavailable and the container is stopped.
Current behavior:
The RingLogger Close() will call awslogs.Log() on all remaining messages in the buffer. Each call is potentially blocking on a channel write that ultimately depends on a call to CloudWatch Logs. These will complete but very slowly as it times out multiple calls to CloudWatch Logs.
After this PR:
awslogs.Log() is no longer blocking. If it's unable to add an item to its internal buffer it returns an error. This has the effect of ignoring log messages that come after the awslogs internal buffer is full.