[Core] Support TPU v7x accelerator type for device discovery#60338
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
There was a problem hiding this comment.
Code Review
This pull request successfully introduces support for Google Cloud's 7th generation TPU (Ironwood), which uses a new naming convention (e.g., tpu7x-16). The changes correctly update the regex for accelerator type validation and the logic for converting the new tpu prefixed types to the internal v prefixed format. The addition of new test cases for tpu7x-16 is also a welcome improvement, ensuring the new functionality works as expected. However, there is a critical oversight regarding the VALID_TPU_TYPES tuple, which needs to be updated to fully support the new v7x generation.
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…urce Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
In this PR we convert the |
| version = accel_type_lower[4:] | ||
| version = accel_type_lower.replace("tpu-", "") | ||
| elif accel_type_lower.startswith("tpu"): | ||
| version = accel_type_lower.replace("tpu", "v") |
There was a problem hiding this comment.
Missing tpu7x conversion in SlicePlacementGroup validation
Medium Severity
The PR description states users can specify tpu7x, tpu-v7x, or v7x as accelerator_version when creating a SlicePlacementGroup. However, _accelerator_version_check directly checks membership in VALID_TPU_TYPES without converting tpu7x to v7x first. While get_tpu_worker_resources (called earlier in __init__) uses get_tpu_version_from_type to convert tpu7x → v7x, the subsequent validation in _accelerator_version_check fails because it doesn't perform the same conversion. Users passing accelerator_version="tpu7x" will get a ValueError despite the documented support.
Description
The TPU 7x generation introduces a change in the accelerator type naming convention reported by the environment. Unlike previous generations (v6e-16, v5p-8, etc.), 7x instances report types starting with tpu (e.g. tpu7x-16).
This PR accounts for the new format and enables Ray to auto-detect the v7x hardware automatically (users don't have to manually configure env vars). This is critical for libraries like Ray Train and for vLLM support - where the automatic device discovery is utilized during JAX initialization.
Related issues
Fixes #59964
Additional information
For more info about TPU v7x: https://docs.cloud.google.com/tpu/docs/tpu7x.