// Small High-Resolution Vision Transformer

VisionForecaster

Decoder-only ViT with Conv Patch Embed · Sector-GPSA · LayerScale · DropPath — tuned for ~500 samples

View on GitHub s230047@e.ntu.edu.sg zt.kang@gmail.com LinkedIn

⬛

Input Tensor

entry point

B × 1 × 457 × 457

→

float32 fold-standardised GICS-reordered distance matrix D_t

reflect pad 457 → 464

🔲

Convolutional Patch Embedding

conv + norm

Conv2d(1, 64, kernel=16, stride=16) → (B, 64, 29, 29) → flatten → (B, 841, 64)

LayerNorm(64)

→

16,576 params total

output: (B, 841, 64)

element-wise add

📍

Positional Embedding

learned

nn.Parameter (1, 841, 64)

→

init: trunc_normal(std=0.02)

tokens = tokens + pos_embed → (B, 841, 64)

× 1 block

🔷

DecoderBlock (drop_path=0.05)

Sector-GPSA + FFN

Attention Branch — Sector-GPSA

LayerNorm (64)

↓

Sector-Gated Positional Self-Attention

QKV proj: 64 → 3×64 (bias=True) · heads=2 · head_dim=32
A_content = softmax(Q·Kᵀ / √32) (B, H, N, N)
A_pos = sector-pair membership matrix (N, N)
  group = frozenset({row_sector, col_sector}) per patch
  A_pos[i,j] = 1/|group(i)| if same group, else 0
  Each row sums to 1 — uniform within-group prior
g_h = sigmoid(λ_h) per head · shape (H,) · init λ=0 → g=0.5
output_h = g_h·(A_pos @ V) + (1−g_h)·(A_content @ V)
out proj: 64 → 64

↓

LayerScale γ₁ (init 1e-2, shape 64)

↓

DropPath (p=0.05)

↓

⤷ residual add to input x

Feed-Forward Branch

LayerNorm (64)

↓

FeedForward MLP

Linear 64 → 256 (mlp_ratio=4×)
GELU
Dropout (0.1)
Linear 256 → 64
Dropout (0.1)

↓

LayerScale γ₂ (init 1e-2, shape 64)

↓

DropPath (p=0.05)

↓

⤷ residual add to post-attn x

output shape unchanged: (B, 841, 64)

≈

Final LayerNorm

post-transformer

LayerNorm (64)

→

(B, 841, 64)

🖼

Pixel Reconstruction Head

single linear

Linear 64→256

→

single linear — direct gradient path from loss to transformer

patch_dim = 1×16×16 = 256 → (B, 841, 256)

unpatchify

→

(B, 1, 464, 464) then crop → (B, 1, 457, 457)

✦

Output Tensor

prediction t+1

B × 1 × 457 × 457

→

predicted change ΔD̂_t = D_{t+1} − D_t · MSE loss vs ground-truth ΔD_t

Convolutional Patch Embedding

Conv2d(1, 64, kernel=16, stride=16) extracts and projects each patch in one step, followed by LayerNorm(64). The norm operates in embed_dim space (128 params fixed) rather than patch space, making it cheaper than a flat linear embedding. Total: 16,576 params.

Sector-GPSA Gate

g_h = sigmoid(λ_h) per head. g→1: head relies on the GICS sector prior. g→0: head is fully content-driven. Initialised at λ=0 (g=0.5) so both streams contribute equally from the first epoch.

Sector Positional Prior A_pos

Row-normalised sector-pair membership matrix. Each patch at grid (r,c) is assigned group frozenset({row_sector, col_sector}), so (r,c) and (c,r) share the same group, preserving distance matrix symmetry. Directly encodes the block-diagonal structure of the GICS-reordered distance matrix.

LayerScale

Per-channel scalar γ on each residual branch, initialised at 1e-2. Provides meaningful gradient signal through the residual branches from the first epoch while stabilising early training on small datasets.

DropPath (Stoch. Depth)

Fixed rate of 0.05 (no linear schedule with only 1 block). Light regularisation — drops the entire block at training time.

Padding & Crop

457 is not divisible by 16. Reflect-pad to 464 = 29 × 16 before tokenisation, then crop back to 457×457 after reconstruction.

Hyperparameters used

in_channels 1

img_size 457

padded_size 464

patch_size 16

grid 29 × 29

N patches 841

embed_dim 64

depth 1

num_heads 2

head_dim 32

mlp_ratio 4×

attn_drop 0.1

proj_drop 0.1

drop_path_rate 0.05

ls_init 1e-2

gate_init λ 0.0

gate_init g 0.50

total params 137,282

📂 No images added yet.
Place your training result PNGs in the same folder as index.html
then add <div class="gallery-card"> blocks inside the .gallery-grid below.

📊

Understanding Transformers

PowerPoint presentation · .pptx

Download View on GitHub

💡 If the Download button doesn't trigger a file save, use View on GitHub and click the download icon on that page instead. GitHub's raw server occasionally blocks direct binary downloads.

// Prerequisites & Setup

Required

🐍 Python 3.13.12

📦 Poetry 2.3.2

Notes

The package is not ready yet. Section under construction.

Clone repo

$ git clone https://github.com/PythoneerKang/VisionForecaster.git

⚠ Git is required. It is pre-installed on most Linux systems by default — if not, install it here. For Windows or macOS, click the respective link to install Git first.

Install dependencies

              $ cd VisionForecaster

              $ poetry install

Activate virtual environment & run

Option A — activate shell

              $ poetry shell

              $ python main.py

Option B — run directly within virtual environment

$ poetry run python main.py

HPC execution (PBS workload manager)

optional

Modify script.pbs to match your cluster's resource requirements, then submit the job:

$ qsub script.pbs

⚠ Code is optimized for CPU-only executions, on a HPC server. Additional changes are required to enable CUDA acceleration.

// Primary Reference — Attention Mechanism

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Stéphane d'Ascoli, Hugo Touvron, Matthew Lerer, Armand Joulin, Piotr Bojanowski, Julien Garrigue · ICML 2021

arXiv: 2103.10697 The Sector-GPSA gating mechanism is adapted from this work. The positional prior is replaced with a sector-membership matrix derived from GICS sector assignments, making the inductive bias domain-specific rather than relying on Euclidean grid distance. → View on arXiv

// Inspiration — Small-Data ViT

Vision Transformer for Small-Size Datasets

Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song · IEEE Access, 2022

DOI: 10.1109/ACCESS.2022.3220167 Small-data motivation: LayerScale, DropPath, and the overall architecture scale follow this work, which demonstrated ViTs can train effectively on small datasets without large-scale pre-training. → View on IEEE Xplore