Alex Here - Blog

Structural Similarity Index (SSIM)

· Alex Here· 7min read· 1463 words

Why I Built SSIM From Scratch for My VSS Project

I’ve been working on a project called Visual Secret Sharing (VSS) for my final project in uni. It’s been a deep dive, taking color images, breaking them into shares that look like static noise (or not), and then reconstructing them. Along the way, I hit a problem: how do you actually measure if your reconstruction is any good?

Most developers instinctively reach for MSE or PSNR when evaluating image quality. I did too at first. The problem is they measure pixel differences, not what the eye actually notices. They’re simple to compute, sure, but they don’t match what our eyes actually see. And when you’re working on something as visual as image reconstruction, that’s a pretty big problem.

That frustration led me to implement the Structural Similarity Index (SSIM) from scratch, no external libraries, just me and a lot of coffee. What follows is how I built it, why each piece matters, and how it’s actually helped me improve my VSS algorithms.

Here’s how SSIM compares images using local structure instead of pixel differences, the math behind the three-component formula (luminance, contrast, structure), why sliding windows catch problems that global metrics miss, and how per-channel SSIM helped me debug my VSS reconstruction.

The Gaussian Window

SSIM works by sliding a window across images to measure local similarity instead of comparing pixel by pixel. I chose an 11x11 Gaussian window (sigma = 1.5) for reasons that make intuitive sense when considering human vision.

private static final int WINDOW_SIZE = 11;
private static final double SIGMA = 1.5;
private static final double[][] GAUSSIAN_KERNEL;

static {
    GAUSSIAN_KERNEL = createGaussianKernel(WINDOW_SIZE, SIGMA);
}

The kernel is a 2D Gaussian function, normalized so the weights add up to 1.0:

private static double[][] createGaussianKernel(int size, double sigma) {
    double[][] kernel = new double[size][size];
    double sum = 0.0;
    int half = size / 2;

    for (int y = 0; y < size; y++) {
        for (int x = 0; x < size; x++) {
            int dx = x - half;
            int dy = y - half;
            double val = Math.exp(-(dx * dx + dy * dy) / (2.0 * sigma * sigma));
            kernel[y][x] = val;
            sum += val;
        }
    }

    // Normalize
    for (int y = 0; y < size; y++) {
        for (int x = 0; x < size; x++) {
            kernel[y][x] /= sum;
        }
    }
    return kernel;
}

Why Gaussian? Your vision is sharpest right where you’re focused, and gets progressively blurrier toward the edges. That’s exactly what a Gaussian gives us, extra weight to the center pixel with influence fading outward.

Why sigma = 1.5? This creates a nice balance, tight enough to catch local details but wide enough to ignore single-pixel noise that doesn’t matter to perception anyway. And since this kernel never changes, I compute it once in a static block and reuse it everywhere. No sense doing the same math over and over.

The SSIM Formula: Breaking It Down

SSIM compares two image chunks using this calculation:

private double calculateWindowSSIM(double[][] ch1, double[][] ch2, int startY, int startX) {
    double mean1 = weightedMean(ch1, startY, startX);
    double mean2 = weightedMean(ch2, startY, startX);

    double var1 = weightedVariance(ch1, startY, startX, mean1);
    double var2 = weightedVariance(ch2, startY, startX, mean2);

    double cov = weightedCovariance(ch1, ch2, startY, startX, mean1, mean2);

    double numerator   = (2.0 * mean1 * mean2 + C1) * (2.0 * cov + C2);
    double denominator = (mean1 * mean1 + mean2 * mean2 + C1) * (var1 + var2 + C2);

    return denominator > 0 ? numerator / denominator : 0.0;
}

That might look intimidating, but it’s really just three intuitive comparisons multiplied together:

  1. Luminance comparison: (2μₓμᵧ + C₁) / (μₓ² + μᵧ² + C₁) — how similar are the average brightness levels?

  2. Contrast comparison: (2σₓσᵧ + C₂) / (σₓ² + σᵧ² + C₂) — do the images have similar amounts of detail or variation?

  3. Structure comparison: (σₓᵧ + C₂/2) / (σₓσᵧ + C₂/2) — do the patterns line up?

The C1 and C2 constants prevent division by zero in flat areas, an important practical concern for real images. Without them, you’d get numerical instability when comparing regions with no variation (like a patch of pure sky).

From Color to Grayscale

When applying SSIM to color images, we actually convert to grayscale first using the ITU-R BT.601 standard. You might wonder why we don’t just compare each color channel separately.

The conversion looks like this:

y1[y][x] = 0.299 * img1.getPixel(0, x, y)   // Red
         + 0.587 * img1.getPixel(1, x, y)   // Green
         + 0.114 * img1.getPixel(2, x, y);  // Blue

The weights aren’t random. They come from decades of broadcast television research and match how our eyes actually perceive brightness in daylight conditions. Green gets the highest weight (0.587) because our visual system is most sensitive to green light. Red gets medium weight (0.299), and blue gets the least (0.114).

This matters because when we’re judging image quality, we’re not really evaluating red, green, and blue independently. We’re evaluating the perceived brightness and structure. A simple average of channels would give blue too much importance relative to what we actually see.

Sliding Windows

Rather than computing a single global SSIM score, my implementation slides that 11x11 window across every possible position in the image. For a 512x512 image, that’s over 250,000 window comparisons that get averaged together.

private double calculateChannelSSIMDouble(double[][] ch1, double[][] ch2, int width, int height) {
    int numWindowsX = width - WINDOW_SIZE + 1;
    int numWindowsY = height - WINDOW_SIZE + 1;
    double totalSSIM = 0.0;
    int windowCount = 0;

    for (int y = 0; y < numWindowsY; y++) {
        for (int x = 0; x < numWindowsX; x++) {
            double windowSSIM = calculateWindowSSIM(ch1, ch2, y, x);
            totalSSIM += windowSSIM;
            windowCount++;
        }
    }

    return windowCount > 0 ? totalSSIM / windowCount : 0.0;
}

Why use this approach? Global metrics can conceal significant issues. Imagine an image where 90% looks perfect but 10% is completely garbled. A global MSE might still look “acceptable,” but that 10% corruption could be ruinous for your use case. With sliding windows, you catch those localized failures. These are exactly the artifacts that pop up in image reconstruction when certain regions lose fidelity depending on how the algorithm handles different color channels or threshold schemes, which is also why I ended up tracking per-channel scores as a debugging tool.

Per-Channel SSIM for Better Debugging

While luminance SSIM provides the primary quality score, I found it incredibly useful to also track SSIM for each color channel separately. This isn’t in the original SSIM spec. It’s just something that proved handy for my final project use cases.

public double calculate(RawImage img1, RawImage img2) {
    double luminanceSSIM = calculateLuminanceSSIM(img1, img2, width, height);

    for (int c = 0; c < 3; c++) {
        int[][] ch1 = img1.getChannelData(c);
        int[][] ch2 = img2.getChannelData(c);
        channelSSIM[c] = calculateChannelSSIM(ch1, ch2, width, height);
    }

    return luminanceSSIM;
}

public double[] getChannelSSIM() {
    return channelSSIM.clone();
}

Why clone the array on return? It’s simple defensive programming. It prevents calling code from accidentally messing with my internal state.

This per-channel breakdown has saved me countless debugging hours. When a reconstruction looks off, I can immediately see which color channel is causing trouble. For instance, if blue channel SSIM is 0.72 while red and green are both above 0.94, I know the blue channel is the weak link.

Seeing which channel struggles tells me exactly where to look in my algorithm implementation.

What I Learned Building This

After implementing SSIMMetric in dependency-free Java, I’ve got a deeper appreciation for why this metric works so well. Every design choice (the window size, the constants, the luminance weights) traces back to actual properties of human vision.

For my VSS implementation specifically, SSIM has been transformative. It catches structural similarities and differences that MSE completely misses. When I’m tweaking algorithms, the per-channel breakdown shows me precisely which aspects need work.

If you’re working on anything that involves image quality, compression, reconstruction, enhancement, or even just filtering, I strongly encourage you to look beyond MSE and PSNR. Implementing SSIM isn’t as hard as you might think, and the payoff in terms of getting metrics that actually match what you see is enormous.

Key Takeaways

SSIM measures structural similarity, not just pixel differences, which makes it much closer to how humans perceive image quality. The 11x11 Gaussian window (σ=1.5) mimics human foveal vision. Sliding windows reveal localized failures that global metrics like MSE hide. And per-channel breakdown is invaluable for debugging algorithm issues.


References

  1. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing, 13(4), 600-612.
  2. ITU-R Recommendation BT.601-7: “Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios.”