Skip to content

[train] Add training failed error back to failure policy log#59957

Merged
matthewdeng merged 2 commits into
ray-project:masterfrom
TimothySeah:tseah/add-error-message-back
Jan 9, 2026
Merged

[train] Add training failed error back to failure policy log#59957
matthewdeng merged 2 commits into
ray-project:masterfrom
TimothySeah:tseah/add-error-message-back

Conversation

@TimothySeah

Copy link
Copy Markdown
Contributor

Summary

Before this PR, the training failed error was buried in the exc_text part of the log. After this PR it should also appear in the message part of the log.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner January 8, 2026 00:55

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to make training failure errors more visible in logs by adding the error message to the main log string, in addition to the traceback provided via exc_info. While this achieves the goal, it introduces redundancy in the log output as the error message will appear twice. I've added a comment with a suggestion to avoid this redundancy while still ensuring the full error details are in the main log message.

Comment thread python/ray/train/v2/_internal/execution/failure_handling/default.py Outdated
@ray-gardener ray-gardener Bot added train Ray Train Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 8, 2026
Comment on lines +47 to +48
f" Error count: {error_count} (max allowed: {retry_limit})\n\n"
f"{training_failed_error}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional for there to be two new lines? I think the extra line might cause a break in the logs that separates the error from the log. Wha do you think about something like this?

Suggested change
f" Error count: {error_count} (max allowed: {retry_limit})\n\n"
f"{training_failed_error}",
f" Error count: {error_count} (max allowed: {retry_limit})\n"
f"Error: {training_failed_error}",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I added two newlines because that's what it was before (https://github.com/ray-project/ray/pull/58287/files#diff-4161b23c45b953b8c90f938eb49acf72441246ee7ca70d889c1be17d967417caL47), though I'm not sure why this was the case.

…lt.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Jan 9, 2026
@matthewdeng matthewdeng merged commit 314e738 into ray-project:master Jan 9, 2026
5 of 7 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…ject#59957)

# Summary

Before this PR, the training failed error was buried in the `exc_text`
part of the log. After this PR it should also appear in the `message`
part of the log.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants