Context
A 3D-guided genomic selection framework that aligns SNP embeddings with LiDAR-derived 3D structure embeddings to learn more informative genotype representations for architecture-related trait prediction.
What I built
- 1Privileged-information training: 3D point clouds used only during training to shape genotype embeddings; inference remains genotype-only.
- 2Robust 3D phenotyping pipeline: TreeQSM → cylinder graph reconstruction → branch-order correction → extraction of branch count + mean branch angle.
- 3Side-by-side benchmarking against strong small-n baselines (RF/GB) and a genotype-only transformer baseline (DPCFormer).
Results
- →Built a paired genotype↔3D dataset for peach architecture: ~131k SNPs across 122 trees plus LiDAR point clouds.
- →Mean branch angle prediction: GenoCLIP achieved MAE 6.8 and PCC 0.644, outperforming genotype-only DPCFormer (MAE 7.4, PCC 0.591) and approaching the strongest classical baseline RF (MAE 6.7, PCC 0.68).
- →Branch count prediction: GenoCLIP improved MAE vs DPCFormer (43.9 vs 48.0) while remaining competitive in correlation (PCC 0.4408 vs 0.4567); RF remained the best small-n baseline (MAE 40.2, PCC 0.445).
- →Improved phenotype reliability for open-vase trees by correcting TreeQSM topology via cylinder-graph reconstruction and branch-order fixing (primary scaffold identification + recursive higher-order assignment).
- →Demonstrated a practical way to inject 3D structural signal as privileged supervision while keeping deployment genotype-only.
Problem
Perennial crop breeding is slow: architecture phenotyping is expensive and time-consuming, while genotype-only deep models can overfit in small-n, ultra-high-dimensional SNP settings and ignore rich 3D structural signal during representation learning.
Approach
Encoded biallelic SNP calls into allele dosage (0/1/2) with an explicit missing value token (3). Reconstructed 3D tree structure from point clouds via TreeQSM cylinders, then corrected topology for open-vase trees by building a cylinder graph and fixing branch orders (identify longest trunk-to-tip scaffolds, then assign higher orders). Trained classical baselines (Linear/Ridge/ElasticNet/RF/GB/XGBoost) and a genotype-only deep baseline (DPCFormer). Proposed GenoCLIP: dual encoders (DPCFormer for SNPs, PTv3 for point clouds) trained with symmetric InfoNCE to align paired genotype↔3D embeddings; then attached a regression head to the genotype encoder for trait prediction using SNPs only at inference.
What I learned
Across 122 trees (~131k SNPs each), GenoCLIP improved the genotype-only deep baseline—especially on the geometry-sensitive mean branch angle trait (MAE 6.8, PCC 0.644 vs DPCFormer MAE 7.4, PCC 0.591)—while remaining close to the strongest classical baseline (RF MAE 6.7, PCC 0.68). On branch count, GenoCLIP reduced MAE vs DPCFormer (43.9 vs 48.0) with competitive PCC (0.4408).
Links