Aquileo | blanchon/dinodepth-dataset · Datasets at Hugging Face
Dataset Viewer
Auto-converted to Parquet Duplicate
dataset
stringclasses
5 values
height
int32
375
768
width
int32
640
1.24k
image
imagewidth (px)
640
1.24k
depth
imagewidth (px)
640
1.24k
depth_scale
float32
0
0.01
blendedmvs
576
768
0.004923
irs
540
960
0.00008
blendedmvs
576
768
0.005389
irs
540
960
0.000055
irs
540
960
0.000082
blendedmvs
576
768
0.000449
tartanair
480
640
0.001013
tartanair
480
640
0.000519
tartanair
480
640
0.000454
tartanair
480
640
0.000182
tartanair
480
640
0.000556
tartanair
480
640
0.000611
tartanair
480
640
0.000453
tartanair
480
640
0.00084
irs
540
960
0.000053
blendedmvs
576
768
0.002539
irs
540
960
0.000028
blendedmvs
576
768
0.003264
blendedmvs
576
768
0.001784
hypersim
768
1,024
0.000152
blendedmvs
576
768
0.002245
tartanair
480
640
0.000598
tartanair
480
640
0.000043
hypersim
768
1,024
0.000199
blendedmvs
576
768
0.001257
irs
540
960
0.00004
hypersim
768
1,024
0.00017
irs
540
960
0.00003
blendedmvs
576
768
0.002204
vkitti
375
1,242
0.001518
irs
540
960
0.00015
irs
540
960
0.000197
irs
540
960
0.000196
tartanair
480
640
0.000083
tartanair
480
640
0.000682
blendedmvs
576
768
0.000025
hypersim
768
1,024
0.00011
tartanair
480
640
0.001487
tartanair
480
640
0.001417
blendedmvs
576
768
0.003576
blendedmvs
576
768
0.004579
tartanair
480
640
0.000399
blendedmvs
576
768
0.000111
tartanair
480
640
0.000497
hypersim
768
1,024
0.000326
tartanair
480
640
0.001421
tartanair
480
640
0.000711
tartanair
480
640
0.000094
blendedmvs
576
768
0.000456
tartanair
480
640
0.000446
tartanair
480
640
0.000912
vkitti
375
1,242
0.001517
irs
540
960
0.000072
tartanair
480
640
0.000126
tartanair
480
640
0.000281
blendedmvs
576
768
0.001312
tartanair
480
640
0.000099
tartanair
480
640
0.00033
tartanair
480
640
0.000308
vkitti
375
1,242
0.001504
blendedmvs
576
768
0.004811
irs
540
960
0.000093
blendedmvs
576
768
0.000018
tartanair
480
640
0.000242
tartanair
480
640
0.000978
blendedmvs
576
768
0.000024
blendedmvs
576
768
0.001127
tartanair
480
640
0.00039
tartanair
480
640
0.00051
hypersim
768
1,024
0.000052
hypersim
768
1,024
0.00015
tartanair
480
640
0.000403
tartanair
480
640
0.000496
blendedmvs
576
768
0.003826
irs
540
960
0.000016
blendedmvs
576
768
0.001471
vkitti
375
1,242
0.001464
tartanair
480
640
0.000467
irs
540
960
0.000122
blendedmvs
576
768
0.000555
blendedmvs
576
768
0.002872
hypersim
768
1,024
0.000279
blendedmvs
576
768
0.001638
irs
540
960
0.000045
tartanair
480
640
0.000337
tartanair
480
640
0.001496
tartanair
480
640
0.000122
blendedmvs
576
768
0.00003
tartanair
480
640
0.001513
blendedmvs
576
768
0.001803
blendedmvs
576
768
0.001334
vkitti
375
1,242
0.00152
tartanair
480
640
0.001121
tartanair
480
640
0.00018
irs
540
960
0.000101
blendedmvs
576
768
0.00252
tartanair
480
640
0.001478
tartanair
480
640
0.00015
blendedmvs
576
768
0.003389
blendedmvs
576
768
0.002912
End of preview. Expand in Data Studio

DinoDepth

DinoDepth

A large, harmonized, pre-shuffled corpus of 358,905 (image, depth) pairs for training monocular affine-invariant depth models. Five complementary sources — indoor, driving, robotics, aerial, and multi-view-stereo — are decoded to a single planar-depth convention, packed into uniform ~1 GB Parquet shards, and globally shuffled so any shard is a representative sample of the whole.

Schema

column type description
dataset string source dataset tag
height, width int32 native resolution of the stored arrays
image binary JPEG bytes (RGB)
depth binary 16-bit PNG — depth = depth_png * depth_scale, 0 = invalid
depth_scale float32 metres per PNG level (arbitrary per-scene units for multi-view stereo)
import io, numpy as np
from PIL import Image
from datasets import load_dataset

ds = load_dataset("blanchon/dinodepth-dataset", split="train", streaming=True)
row = next(iter(ds))
rgb   = Image.open(io.BytesIO(row["image"]))                                  # RGB
depth = np.asarray(Image.open(io.BytesIO(row["depth"]))) * row["depth_scale"]  # m; 0 = invalid

Composition

source samples domain depth
TartanAir 186,693 robotics / aerial (synthetic) metric
BlendedMVS 74,838 multi-view stereo (real images) non-metric
IRS 57,819 indoor (synthetic) metric
Hypersim 26,912 indoor (synthetic) metric
Virtual KITTI 2 12,643 driving (synthetic) metric
total 358,905

Harmonization

Every source is decoded to planar depth Z, with invalid pixels (sky, far-plane saturation, unreconstructed, non-finite) set to 0:

  • Hypersim — ray distance → planar Z (per-pixel, focal 886.81 px).
  • Virtual KITTI 2 — native depth / 100 (cm → m).
  • IRS — disparity → depth (48 / disparity).
  • TartanAir — native metres; sky masked.
  • BlendedMVS — per-scene multi-view-stereo depth (arbitrary, non-metric scale).

Depth is stored at native resolution as 16-bit PNG with a per-image depth_scale. Metric sources are clamped to a 100 m far plane; the multi-view-stereo source keeps its per-scene scale — a scale-and-shift-invariant objective absorbs it, so the sources mix directly.

Splits

  • train — the full 358,905-image corpus, globally shuffled into 141 shards (~1 GB each, small row groups + page index for fast random access).

Evaluation (held-out)

Zero-shot benchmarks, kept out of training and shipped as separate configs. Evaluate with per-image affine (least-squares scale + shift) alignment of predicted disparity to ground truth, then report AbsRel and δ1:

config benchmark notes
nyuv2 NYUv2 indoor RGB-D; Eigen 654 test split; metric depth, capped at 10 m
kitti KITTI driving; Eigen test split; sparse LiDAR depth, capped at 80 m
ev = load_dataset("blanchon/dinodepth-dataset", name="nyuv2", split="test")

License

Released for non-commercial research under CC BY-NC-SA 4.0. Each source retains its original license (Virtual KITTI 2: CC BY-NC-SA 3.0; Hypersim: CC BY-SA 3.0; IRS / BlendedMVS / TartanAir per their upstream terms) — respect the original terms of each source.


Data composition and recipe follow AnyDepth (arXiv:2601.02760).

Downloads last month
-

Paper for blanchon/dinodepth-dataset