// Small High-Resolution Vision Transformer

VisionForecaster

Decoder-only ViT with Conv Patch Embed · Sector-GPSA · LayerScale · DropPath — tuned for ~500 samples
Input Tensor
entry point
B × 1 × 457 × 457
float32  fold-standardised GICS-reordered distance matrix D_t
reflect pad 457 → 464
🔲
Convolutional Patch Embedding
conv + norm
Conv2d(1, 64, kernel=16, stride=16) → (B, 64, 29, 29) → flatten → (B, 841, 64)
LayerNorm(64)
16,576 params total
output: (B, 841, 64)
element-wise add
📍
Positional Embedding
learned
nn.Parameter (1, 841, 64)
init: trunc_normal(std=0.02)
tokens = tokens + pos_embed  →  (B, 841, 64)
× 1 block
🔷
DecoderBlock  (drop_path=0.05)
Sector-GPSA + FFN
Attention Branch — Sector-GPSA
LayerNorm (64)
Sector-Gated Positional Self-Attention
QKV proj: 64 → 3×64 (bias=True) · heads=2 · head_dim=32
A_content = softmax(Q·Kᵀ / √32)   (B, H, N, N)
A_pos = sector-pair membership matrix   (N, N)
  group = frozenset({row_sector, col_sector}) per patch
  A_pos[i,j] = 1/|group(i)| if same group, else 0
  Each row sums to 1 — uniform within-group prior
g_h = sigmoid(λ_h) per head · shape (H,) · init λ=0 → g=0.5
output_h = g_h·(A_pos @ V) + (1−g_h)·(A_content @ V)
out proj: 64 → 64
LayerScale γ₁ (init 1e-2, shape 64)
DropPath (p=0.05)
⤷ residual add to input x
Feed-Forward Branch
LayerNorm (64)
FeedForward MLP
Linear 64 → 256 (mlp_ratio=4×)
GELU
Dropout (0.1)
Linear 256 → 64
Dropout (0.1)
LayerScale γ₂ (init 1e-2, shape 64)
DropPath (p=0.05)
⤷ residual add to post-attn x
output shape unchanged: (B, 841, 64)
Final LayerNorm
post-transformer
LayerNorm (64)
(B, 841, 64)
🖼
Pixel Reconstruction Head
single linear
Linear 64→256
single linear — direct gradient path from loss to transformer
patch_dim = 1×16×16 = 256  →  (B, 841, 256)
unpatchify
(B, 1, 464, 464)  then crop  →  (B, 1, 457, 457)
Output Tensor
prediction t+1
B × 1 × 457 × 457
predicted change ΔD̂_t = D_{t+1} − D_t  ·  MSE loss vs ground-truth ΔD_t
Convolutional Patch Embedding

Conv2d(1, 64, kernel=16, stride=16) extracts and projects each patch in one step, followed by LayerNorm(64). The norm operates in embed_dim space (128 params fixed) rather than patch space, making it cheaper than a flat linear embedding. Total: 16,576 params.

Sector-GPSA Gate

g_h = sigmoid(λ_h) per head. g→1: head relies on the GICS sector prior. g→0: head is fully content-driven. Initialised at λ=0 (g=0.5) so both streams contribute equally from the first epoch.

Sector Positional Prior A_pos

Row-normalised sector-pair membership matrix. Each patch at grid (r,c) is assigned group frozenset({row_sector, col_sector}), so (r,c) and (c,r) share the same group, preserving distance matrix symmetry. Directly encodes the block-diagonal structure of the GICS-reordered distance matrix.

LayerScale

Per-channel scalar γ on each residual branch, initialised at 1e-2. Provides meaningful gradient signal through the residual branches from the first epoch while stabilising early training on small datasets.

DropPath (Stoch. Depth)

Fixed rate of 0.05 (no linear schedule with only 1 block). Light regularisation — drops the entire block at training time.

Padding & Crop

457 is not divisible by 16. Reflect-pad to 464 = 29 × 16 before tokenisation, then crop back to 457×457 after reconstruction.

📊
Understanding Transformers
PowerPoint presentation  ·  .pptx
💡 If the Download button doesn't trigger a file save, use View on GitHub and click the download icon on that page instead. GitHub's raw server occasionally blocks direct binary downloads.
Required
🐍 Python 3.13.12
📦 Poetry 2.3.2
Notes
The package is not ready yet. Section under construction.
1
Clone repo
$ git clone https://github.com/PythoneerKang/VisionForecaster.git
⚠  Git is required. It is pre-installed on most Linux systems by default — if not, install it here. For Windows or macOS, click the respective link to install Git first.
2
Install dependencies
$ cd VisionForecaster
$ poetry install
3
Activate virtual environment & run
Option A — activate shell
$ poetry shell
$ python main.py
Option B — run directly within virtual environment
$ poetry run python main.py
4
HPC execution (PBS workload manager)
optional
Modify script.pbs to match your cluster's resource requirements, then submit the job:
$ qsub script.pbs
⚠  Code is optimized for CPU-only executions, on a HPC server. Additional changes are required to enable CUDA acceleration.
// Primary Reference — Attention Mechanism
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Stéphane d'Ascoli, Hugo Touvron, Matthew Lerer, Armand Joulin, Piotr Bojanowski, Julien Garrigue  ·  ICML 2021
arXiv: 2103.10697 The Sector-GPSA gating mechanism is adapted from this work. The positional prior is replaced with a sector-membership matrix derived from GICS sector assignments, making the inductive bias domain-specific rather than relying on Euclidean grid distance.  → View on arXiv
// Inspiration — Small-Data ViT
Vision Transformer for Small-Size Datasets
Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song  ·  IEEE Access, 2022
DOI: 10.1109/ACCESS.2022.3220167 Small-data motivation: LayerScale, DropPath, and the overall architecture scale follow this work, which demonstrated ViTs can train effectively on small datasets without large-scale pre-training.  → View on IEEE Xplore