-
Notifications
You must be signed in to change notification settings - Fork 18.8k
[27.x]: Make log reading more robust to errors #48842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Times cannot be compared with `==` and instead should use the `t.Equal` function. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit b37c8a0) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This simplifies how we manage log files, especially rotated ones. It also fixes a long-standing issue to lazily open rotated files so we don't needlessly start decompressing files that we don't need. Much of this is just setting things up for commits following this one. It uses ReaderAtSize for managing all files to be tailed and manages cleanups by passing closures. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit 77f2d90) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When there is an error in parsing an individual log file just close the log and move on to the next one instead of erroring our the entire request. I investigated trying to error correct and scan ahead for corrupted log files but found this is too much of a risk of parsing things we shouldn't be and hence why this is just dropping the rest of the file. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit 1b46faf) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This plumbs a context down the stack and handles cancellation as needed so that we can have correlated traces from the API. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit dbf6873) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This allows for an individual decode operation to be cancelled while the log reader is reading data from a log file by closing the underlying file. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit 9b6ba18) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Co-authored-by:: Cory Snider <csnider@mirantis.com> Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit 6d94122) Signed-off-by: Brian Goff <cpuguy83@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
but could use a second pair of eyes
| ReadLogs(ReadConfig) *LogWatcher | ||
| ReadLogs(context.Context, ReadConfig) *LogWatcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked if this interface change would affect logging plugins, but those implement a separate interface;
Lines 21 to 27 in da3cc1c
| // logPlugin defines the available functions that logging plugins must implement. | |
| type logPlugin interface { | |
| StartLogging(streamPath string, info Info) (err error) | |
| StopLogging(streamPath string) (err error) | |
| Capabilities() (cap Capability, err error) | |
| ReadLogs(info Info, config ReadConfig) (stream io.ReadCloser, err error) | |
| } |
And there's an adaptor for those; see further up the diff 😄

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Backport of #47983
The only one commit that needed some (minor) touchup was the journald changes.
Fundamentally this change (beyond the minor refactor and tracing spans) is to not treat errors reading from log files as fatal to providing logs to the client.
The need for this change is because often log files may be corrupted either due to old bugs in dockerd or for other reasons.
One specific thing we have seen seems to be caused by ext4 by default is not doing full journaling. On power loss or kernel panic this can cause the log data to be corrupted.
We see this mostly on IoT devices which are often not on battery backup or frequently hard cut from power.
Instead of erroring out a stream on errors from a particular file this just skips on to the next file.
I had made a previous attempt to be a little more clever and scrub the file to try and find a place where we can start parsing again but realized this can cause serious issues and is generally error prone.
This also takes care of a long-standing todo to handle log files lazily (or at least as lazily as possible) so we only end up decompressing files just-in-time.
Unfortunately we don't have a way to return warnings to the client that we had to skip files, however with the tracing changes at least it could be seen there and in the daemon logs.