Datasets:
dataset stringclasses 5
values | height int32 375 768 | width int32 640 1.24k | image imagewidth (px) 640 1.24k | depth imagewidth (px) 640 1.24k | depth_scale float32 0 0.01 |
|---|---|---|---|---|---|
blendedmvs | 576 | 768 | 0.004923 | ||
irs | 540 | 960 | 0.00008 | ||
blendedmvs | 576 | 768 | 0.005389 | ||
irs | 540 | 960 | 0.000055 | ||
irs | 540 | 960 | 0.000082 | ||
blendedmvs | 576 | 768 | 0.000449 | ||
tartanair | 480 | 640 | 0.001013 | ||
tartanair | 480 | 640 | 0.000519 | ||
tartanair | 480 | 640 | 0.000454 | ||
tartanair | 480 | 640 | 0.000182 | ||
tartanair | 480 | 640 | 0.000556 | ||
tartanair | 480 | 640 | 0.000611 | ||
tartanair | 480 | 640 | 0.000453 | ||
tartanair | 480 | 640 | 0.00084 | ||
irs | 540 | 960 | 0.000053 | ||
blendedmvs | 576 | 768 | 0.002539 | ||
irs | 540 | 960 | 0.000028 | ||
blendedmvs | 576 | 768 | 0.003264 | ||
blendedmvs | 576 | 768 | 0.001784 | ||
hypersim | 768 | 1,024 | 0.000152 | ||
blendedmvs | 576 | 768 | 0.002245 | ||
tartanair | 480 | 640 | 0.000598 | ||
tartanair | 480 | 640 | 0.000043 | ||
hypersim | 768 | 1,024 | 0.000199 | ||
blendedmvs | 576 | 768 | 0.001257 | ||
irs | 540 | 960 | 0.00004 | ||
hypersim | 768 | 1,024 | 0.00017 | ||
irs | 540 | 960 | 0.00003 | ||
blendedmvs | 576 | 768 | 0.002204 | ||
vkitti | 375 | 1,242 | 0.001518 | ||
irs | 540 | 960 | 0.00015 | ||
irs | 540 | 960 | 0.000197 | ||
irs | 540 | 960 | 0.000196 | ||
tartanair | 480 | 640 | 0.000083 | ||
tartanair | 480 | 640 | 0.000682 | ||
blendedmvs | 576 | 768 | 0.000025 | ||
hypersim | 768 | 1,024 | 0.00011 | ||
tartanair | 480 | 640 | 0.001487 | ||
tartanair | 480 | 640 | 0.001417 | ||
blendedmvs | 576 | 768 | 0.003576 | ||
blendedmvs | 576 | 768 | 0.004579 | ||
tartanair | 480 | 640 | 0.000399 | ||
blendedmvs | 576 | 768 | 0.000111 | ||
tartanair | 480 | 640 | 0.000497 | ||
hypersim | 768 | 1,024 | 0.000326 | ||
tartanair | 480 | 640 | 0.001421 | ||
tartanair | 480 | 640 | 0.000711 | ||
tartanair | 480 | 640 | 0.000094 | ||
blendedmvs | 576 | 768 | 0.000456 | ||
tartanair | 480 | 640 | 0.000446 | ||
tartanair | 480 | 640 | 0.000912 | ||
vkitti | 375 | 1,242 | 0.001517 | ||
irs | 540 | 960 | 0.000072 | ||
tartanair | 480 | 640 | 0.000126 | ||
tartanair | 480 | 640 | 0.000281 | ||
blendedmvs | 576 | 768 | 0.001312 | ||
tartanair | 480 | 640 | 0.000099 | ||
tartanair | 480 | 640 | 0.00033 | ||
tartanair | 480 | 640 | 0.000308 | ||
vkitti | 375 | 1,242 | 0.001504 | ||
blendedmvs | 576 | 768 | 0.004811 | ||
irs | 540 | 960 | 0.000093 | ||
blendedmvs | 576 | 768 | 0.000018 | ||
tartanair | 480 | 640 | 0.000242 | ||
tartanair | 480 | 640 | 0.000978 | ||
blendedmvs | 576 | 768 | 0.000024 | ||
blendedmvs | 576 | 768 | 0.001127 | ||
tartanair | 480 | 640 | 0.00039 | ||
tartanair | 480 | 640 | 0.00051 | ||
hypersim | 768 | 1,024 | 0.000052 | ||
hypersim | 768 | 1,024 | 0.00015 | ||
tartanair | 480 | 640 | 0.000403 | ||
tartanair | 480 | 640 | 0.000496 | ||
blendedmvs | 576 | 768 | 0.003826 | ||
irs | 540 | 960 | 0.000016 | ||
blendedmvs | 576 | 768 | 0.001471 | ||
vkitti | 375 | 1,242 | 0.001464 | ||
tartanair | 480 | 640 | 0.000467 | ||
irs | 540 | 960 | 0.000122 | ||
blendedmvs | 576 | 768 | 0.000555 | ||
blendedmvs | 576 | 768 | 0.002872 | ||
hypersim | 768 | 1,024 | 0.000279 | ||
blendedmvs | 576 | 768 | 0.001638 | ||
irs | 540 | 960 | 0.000045 | ||
tartanair | 480 | 640 | 0.000337 | ||
tartanair | 480 | 640 | 0.001496 | ||
tartanair | 480 | 640 | 0.000122 | ||
blendedmvs | 576 | 768 | 0.00003 | ||
tartanair | 480 | 640 | 0.001513 | ||
blendedmvs | 576 | 768 | 0.001803 | ||
blendedmvs | 576 | 768 | 0.001334 | ||
vkitti | 375 | 1,242 | 0.00152 | ||
tartanair | 480 | 640 | 0.001121 | ||
tartanair | 480 | 640 | 0.00018 | ||
irs | 540 | 960 | 0.000101 | ||
blendedmvs | 576 | 768 | 0.00252 | ||
tartanair | 480 | 640 | 0.001478 | ||
tartanair | 480 | 640 | 0.00015 | ||
blendedmvs | 576 | 768 | 0.003389 | ||
blendedmvs | 576 | 768 | 0.002912 |
DinoDepth
A large, harmonized, pre-shuffled corpus of 358,905 (image, depth) pairs for training monocular affine-invariant depth models. Five complementary sources — indoor, driving, robotics, aerial, and multi-view-stereo — are decoded to a single planar-depth convention, packed into uniform ~1 GB Parquet shards, and globally shuffled so any shard is a representative sample of the whole.
Schema
| column | type | description |
|---|---|---|
dataset |
string | source dataset tag |
height, width |
int32 | native resolution of the stored arrays |
image |
binary | JPEG bytes (RGB) |
depth |
binary | 16-bit PNG — depth = depth_png * depth_scale, 0 = invalid |
depth_scale |
float32 | metres per PNG level (arbitrary per-scene units for multi-view stereo) |
import io, numpy as np
from PIL import Image
from datasets import load_dataset
ds = load_dataset("blanchon/dinodepth-dataset", split="train", streaming=True)
row = next(iter(ds))
rgb = Image.open(io.BytesIO(row["image"])) # RGB
depth = np.asarray(Image.open(io.BytesIO(row["depth"]))) * row["depth_scale"] # m; 0 = invalid
Composition
| source | samples | domain | depth |
|---|---|---|---|
| TartanAir | 186,693 | robotics / aerial (synthetic) | metric |
| BlendedMVS | 74,838 | multi-view stereo (real images) | non-metric |
| IRS | 57,819 | indoor (synthetic) | metric |
| Hypersim | 26,912 | indoor (synthetic) | metric |
| Virtual KITTI 2 | 12,643 | driving (synthetic) | metric |
| total | 358,905 |
Harmonization
Every source is decoded to planar depth Z, with invalid pixels (sky, far-plane saturation,
unreconstructed, non-finite) set to 0:
- Hypersim — ray distance → planar
Z(per-pixel, focal 886.81 px). - Virtual KITTI 2 — native depth / 100 (cm → m).
- IRS — disparity → depth (
48 / disparity). - TartanAir — native metres; sky masked.
- BlendedMVS — per-scene multi-view-stereo depth (arbitrary, non-metric scale).
Depth is stored at native resolution as 16-bit PNG with a per-image depth_scale. Metric sources
are clamped to a 100 m far plane; the multi-view-stereo source keeps its per-scene scale — a
scale-and-shift-invariant objective absorbs it, so the sources mix directly.
Splits
train— the full 358,905-image corpus, globally shuffled into 141 shards (~1 GB each, small row groups + page index for fast random access).
Evaluation (held-out)
Zero-shot benchmarks, kept out of training and shipped as separate configs. Evaluate with per-image affine (least-squares scale + shift) alignment of predicted disparity to ground truth, then report AbsRel and δ1:
| config | benchmark | notes |
|---|---|---|
nyuv2 |
NYUv2 | indoor RGB-D; Eigen 654 test split; metric depth, capped at 10 m |
kitti |
KITTI | driving; Eigen test split; sparse LiDAR depth, capped at 80 m |
ev = load_dataset("blanchon/dinodepth-dataset", name="nyuv2", split="test")
License
Released for non-commercial research under CC BY-NC-SA 4.0. Each source retains its original license (Virtual KITTI 2: CC BY-NC-SA 3.0; Hypersim: CC BY-SA 3.0; IRS / BlendedMVS / TartanAir per their upstream terms) — respect the original terms of each source.
Data composition and recipe follow AnyDepth (arXiv:2601.02760).
- Downloads last month
- -
