[train] Add training failed error back to failure policy log#59957
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request aims to make training failure errors more visible in logs by adding the error message to the main log string, in addition to the traceback provided via exc_info. While this achieves the goal, it introduces redundancy in the log output as the error message will appear twice. I've added a comment with a suggestion to avoid this redundancy while still ensuring the full error details are in the main log message.
| f" Error count: {error_count} (max allowed: {retry_limit})\n\n" | ||
| f"{training_failed_error}", |
There was a problem hiding this comment.
Is it intentional for there to be two new lines? I think the extra line might cause a break in the logs that separates the error from the log. Wha do you think about something like this?
| f" Error count: {error_count} (max allowed: {retry_limit})\n\n" | |
| f"{training_failed_error}", | |
| f" Error count: {error_count} (max allowed: {retry_limit})\n" | |
| f"Error: {training_failed_error}", |
There was a problem hiding this comment.
Sure. I added two newlines because that's what it was before (https://github.com/ray-project/ray/pull/58287/files#diff-4161b23c45b953b8c90f938eb49acf72441246ee7ca70d889c1be17d967417caL47), though I'm not sure why this was the case.
…lt.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ject#59957) # Summary Before this PR, the training failed error was buried in the `exc_text` part of the log. After this PR it should also appear in the `message` part of the log. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Summary
Before this PR, the training failed error was buried in the
exc_textpart of the log. After this PR it should also appear in themessagepart of the log.Testing
Unit tests