Conv2d(1, 64, kernel=16, stride=16) extracts and projects each patch in one step, followed by LayerNorm(64). The norm operates in embed_dim space (128 params fixed) rather than patch space, making it cheaper than a flat linear embedding. Total: 16,576 params.
g_h = sigmoid(λ_h) per head. g→1: head relies on the GICS sector prior. g→0: head is fully content-driven. Initialised at λ=0 (g=0.5) so both streams contribute equally from the first epoch.
Row-normalised sector-pair membership matrix. Each patch at grid (r,c) is assigned group frozenset({row_sector, col_sector}), so (r,c) and (c,r) share the same group, preserving distance matrix symmetry. Directly encodes the block-diagonal structure of the GICS-reordered distance matrix.
Per-channel scalar γ on each residual branch, initialised at 1e-2. Provides meaningful gradient signal through the residual branches from the first epoch while stabilising early training on small datasets.
Fixed rate of 0.05 (no linear schedule with only 1 block). Light regularisation — drops the entire block at training time.
457 is not divisible by 16. Reflect-pad to 464 = 29 × 16 before tokenisation, then crop back to 457×457 after reconstruction.
index.html<div class="gallery-card"> blocks inside the .gallery-grid below.
script.pbs to match your cluster's resource requirements, then submit the job: