<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://skandanyal.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://skandanyal.github.io/" rel="alternate" type="text/html" /><updated>2026-07-01T19:30:59+00:00</updated><id>https://skandanyal.github.io/feed.xml</id><title type="html">Skandan C Yalagach</title><subtitle>From Math to Machines.</subtitle><author><name>Skandan C Yalagach</name><email>skandanyalagach@gmail.com</email><uri>https://skandanyal.github.io</uri></author><entry><title type="html">Roofline Models: From Speculation to Modelling</title><link href="https://skandanyal.github.io/from_math_to_machines/gb4/" rel="alternate" type="text/html" title="Roofline Models: From Speculation to Modelling" /><published>2026-03-26T00:00:00+00:00</published><updated>2026-03-26T00:00:00+00:00</updated><id>https://skandanyal.github.io/from_math_to_machines/gb4</id><content type="html" xml:base="https://skandanyal.github.io/from_math_to_machines/gb4/"><![CDATA[<h2 id="understanding-the-machine">Understanding the Machine</h2>

<p>A modern Central Processing Unit (CPU) consists of multiple independent execution units called <strong>cores</strong>. Each core is an independent hardware unit containing its own execution pipeline, registers, and control logic. A multicore CPU with $N$ physical cores can run up to $N$ independent execution streams in parallel.</p>

<p>Unlike physical cores, a <strong>thread</strong> is a software-defined execution context comprising a program counter, register states, and a call stack. Operating systems schedule threads onto physical cores. A core executes instructions from a single thread context at any given moment, unless it supports <strong>Simultaneous Multithreading (SMT)</strong>.</p>

<p>SMT allows a single physical core to interleave instructions from multiple thread contexts, optimizing pipeline utilization without doubling raw execution hardware.</p>

<h2 id="parallel-execution">Parallel Execution</h2>

<p>Parallelism arises from simultaneously executing workload(s) on multiple cores.</p>

<p>For $C$ physical cores and $T$ threads:</p>
<ul>
  <li>If $T \le C$, threads can execute concurrently in parallel.</li>
  <li>If $T &gt; C$, the operating system must time-slice execution.</li>
</ul>

<p>In C++, multi-threading can be introduced via OpenMP directives or standard <code class="language-plaintext highlighter-rouge">std::jthread</code>/<code class="language-plaintext highlighter-rouge">std::thread</code>. While standard library threads provide low-level control, OpenMP simplifies thread pooling and work sharing with minimal developer overhead.</p>

<p>SMT allows multiple threads to share a single core, but this is a latency-hiding mechanism rather than a source of linear speedup.</p>

<h2 id="memory-hierarchy">Memory Hierarchy</h2>

<p>The processor does not operate directly on data from main memory in most cases. Instead, it 
relies on a hierarchy:</p>

<ul>
  <li><strong>Registers</strong> (inside the core, fastest)</li>
  <li><strong>L1 / L2 / L3 caches</strong> (on-chip, small but fast)</li>
  <li><strong>Main memory (RAM)</strong></li>
  <li><strong>Secondary storage (SSD/HDD)</strong></li>
</ul>

<p>Data movement becomes progressively slower and more expensive as we go down this hierarchy.</p>

<p>For performance analysis, the critical boundary is between:</p>
<ul>
  <li>on-chip computation</li>
  <li>off-chip memory (RAM)</li>
</ul>

<h2 id="compute-vs-data-movement">Compute vs Data Movement</h2>

<p>Any program performs two fundamental actions:</p>
<ol>
  <li><strong>Compute</strong> - arithmetic operations executed by the core</li>
  <li><strong>Data movement</strong> - transferring data between memory and the core</li>
</ol>

<p>Performance is limited by whichever of these becomes the bottleneck.</p>

<p>To quantify this, we define:</p>

<ul>
  <li><strong>Peak compute throughput</strong>: maximum operations per second a CPU can perform</li>
  <li><strong>Memory bandwidth</strong>: maximum rate at which data can be transferred from memory</li>
</ul>

<h2 id="arithmetic-intensity">Arithmetic Intensity</h2>

<p>Arithmetic intensity (AI) is the ratio of floating-point operations (FLOPs) to memory transfers (bytes) for a given kernel. It quantifies how many times each fetched byte is reused in computation; a higher AI implies more compute work per memory access.</p>

<p>This ratio categorizes kernels into two regimes:</p>
<ul>
  <li>Low intensity → memory-bound</li>
  <li>High intensity → compute-bound</li>
</ul>

<blockquote>
  <p>For example, the STREAM Triad benchmark (<code class="language-plaintext highlighter-rouge">A[i] = B[i] + scalar * C[i]</code>) measures sustainable memory bandwidth. Each iteration performs 2 FLOPs (one multiply, one add) but transfers 3 vectors’ worth of data (two reads, one write), totaling 24 bytes in double-precision. This yields an arithmetic intensity of approximately 0.083 FLOPs/byte, classifying it as a heavily memory-bound kernel.</p>
</blockquote>

<hr />

<h2 id="toward-the-roofline-model">Toward the Roofline Model</h2>

<p>The roofline model was first introduced by a group of scientists at the University of California, Berkeley
in 2008, in a paper titled:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Roofline: An Insightful Visual Performance Model for Floating-Point Programs 
and Multicore Architectures

By Samuel Williams, Andrew Waterman and David Patterson
</code></pre></div></div>

<p>The Roofline model combines peak compute throughput, memory bandwidth, and operational intensity to determine the maximum achievable performance of a program on a given machine. It answers two key questions:</p>
<ul>
  <li>Is the program limited by compute or memory?</li>
  <li>How far is the implementation from hardware limits?</li>
</ul>

<h2 id="building-a-roofline-model">Building a Roofline Model</h2>

<p>Constructing a Roofline model requires precise parameters. The accuracy of the model depends on the rigor of these measurements.</p>

<ul>
  <li><strong>Peak Sustained Memory Bandwidth:</strong> Measured using benchmarking suites like STREAM (specifically the Triad kernel). Refer to <a href="https://www.github.com/skandanyal/Glacier.HPC/STREAM">this implementation</a> for detail.</li>
</ul>

\[\text{Bandwidth} = \frac{\text{Bytes transferred}}{\text{Execution time}} \text{ Bytes/sec}\]

<ul>
  <li><strong>Peak Compute Throughput:</strong></li>
</ul>

\[\text{Peak compute (FLOPs/sec)} = \text{Cores} \times \text{Frequency} \times \text{FLOPs per cycle per core}\]

<ul>
  <li><strong>Arithmetic Intensity:</strong></li>
</ul>

\[\text{Arithmetic Intensity (FLOPs/Byte)} = \frac{\text{FLOPs}}{\text{Bytes transferred}}\]

<ul>
  <li><strong>Arithmetic Intensity Ridge Point:</strong></li>
</ul>

\[\text{Ridge Point (FLOPs/Byte)} = \frac{\text{Peak compute throughput}}{\text{Memory bandwidth}}\]

<ul>
  <li><strong>The Roofline Bound:</strong></li>
</ul>

\[\text{Attainable Performance (FLOPs/sec)} = \min\left(\text{Peak Compute Throughput},\; \text{Memory Bandwidth} \times \text{Arithmetic Intensity}\right)\]

<h2 id="visualizing-a-roofline-model">Visualizing a Roofline Model</h2>

<p>Here is the roofline model characterising the AMD Ryzen 5 6600H processor.</p>

<p><img src="/assets/blog4/roofline.png" alt="Roofline model" /></p>

<ul>
  <li><strong>X-axis:</strong> Arithmetic Intensity (log scale).</li>
  <li><strong>Y-axis:</strong> Performance in FLOPs/sec (log scale).</li>
</ul>

<p>Workloads fall into one of two operational regimes:</p>
<ul>
  <li><strong>Memory-Bound (blue):</strong> Performance is limited by and scales with memory bandwidth.</li>
  <li><strong>Compute-Bound (orange):</strong> Performance is capped by the peak compute capacity of the execution units.</li>
</ul>

<p>The upper boundary (plotted in green) represents the peak achievable performance as a function of Arithmetic Intensity (AI). The diagonal line represents the memory-bandwidth limit, and the horizontal ceiling represents the compute limit.</p>

<h2 id="using-the-roofline-model">Using the Roofline Model</h2>

<p>The Roofline model serves as a diagnostic tool, mapping workloads on a coordinate system to represent their execution behavior. The optimization goal is to increase arithmetic intensity, moving the workload to the right to saturate execution units and extract performance from the hardware.</p>

<ul>
  <li><strong>For memory-bound workloads:</strong> Increasing AI moves performance diagonally up the slope until it hits the compute ceiling.</li>
  <li><strong>For compute-bound workloads:</strong> Performance is improved by optimizing execution efficiency (e.g., using SIMD vector instructions or reducing instruction overhead).</li>
</ul>

<h2 id="glacierhpc">Glacier.HPC</h2>

<p>Glacier.HPC focuses on profiling, benchmarking, and optimizing numerical kernels on consumer-grade hardware. Derived from the <a href="https://www.github.com/skandanyal/Glacier.ML">Glacier.ML</a> codebase, this project focuses on:</p>
<ul>
  <li>Measuring arithmetic intensity and raw performance.</li>
  <li>Constructing empirical Roofline models.</li>
  <li>Evaluating compiler and loop optimization strategies.</li>
</ul>

<p>These insights are then used to optimize Glacier.ML.</p>

<p>To reference this post with full LaTeX equation formatting, you can view the raw markdown source at <a href="https://github.com/skandanyal/skandanyal.github.io/blob/main/_posts/2026-03-26-gb4.md">GitHub</a>.</p>

<p>This post is based on experimental results from Glacier.HPC. A detailed analysis using the SAXPY kernel will be covered in the next post.</p>]]></content><author><name>Skandan C Yalagach</name><email>skandanyalagach@gmail.com</email><uri>https://skandanyal.github.io</uri></author><category term="Roofline models" /><category term="Device characterization" /><summary type="html"><![CDATA[Performance modelling isn't just guess work. It is a science backed by iterative experimentation and edivence gained.]]></summary></entry><entry><title type="html">Building SVMs from Scratch in C++: Glacier’s PEGASOS Implementation</title><link href="https://skandanyal.github.io/from_math_to_machines/gb3/" rel="alternate" type="text/html" title="Building SVMs from Scratch in C++: Glacier’s PEGASOS Implementation" /><published>2026-01-10T00:00:00+00:00</published><updated>2026-01-10T00:00:00+00:00</updated><id>https://skandanyal.github.io/from_math_to_machines/gb3</id><content type="html" xml:base="https://skandanyal.github.io/from_math_to_machines/gb3/"><![CDATA[<p>Glacier.ML now includes working implementations of:</p>

<ul>
  <li><strong>SVM Classifier</strong></li>
  <li><strong>SVM Regressor</strong></li>
</ul>

<p>Both are implemented in C++ using:</p>

<ul>
  <li><strong>Eigen</strong> for linear algebra</li>
  <li><strong>OpenMP</strong> for multithreading</li>
  <li><strong>OpenBLAS</strong> for optimized BLAS routines</li>
</ul>

<p>In local benchmarks, the classifier variant demonstrates <strong>4–10× faster training</strong> compared to scikit-learn under comparable conditions. This post documents the algorithmic design, performance characteristics, and implementation philosophy behind these models.</p>

<h2 id="algorithmic-clarification-not-classical-kernel-svm">Algorithmic Clarification: Not Classical Kernel SVM</h2>

<p>Glacier’s SVM implementations are not dual-form, kernelized SVMs. Instead, they are based on:</p>

<p><strong>PEGASOS</strong><br />
(Primal Estimated sub-Gradient Solver for SVM)</p>

<p>PEGASOS directly optimizes the primal objective using stochastic subgradient descent.</p>

<h3 id="classification-objective">Classification Objective</h3>

\[\min_{w} \frac{\lambda}{2} \|w\|^2
+ \frac{1}{m} \sum_{i=1}^{m} \max\left(0,\; 1 - y_i\, w^{T} x_i\right)\]

<h3 id="regression-objective-ε-insensitive-loss">Regression Objective (ε-insensitive loss)</h3>

\[\min_{w} \frac{\lambda}{2} \|w\|^2
+ \frac{1}{m} \sum_{i=1}^{m} \max\left(0,\; |y_i - w^{T} x_i| - \epsilon\right)\]

<p>Consequently, Glacier’s SVM variants are closer in behavior to linear stochastic gradient descent (<code class="language-plaintext highlighter-rouge">SGDClassifier</code> / <code class="language-plaintext highlighter-rouge">SGDRegressor</code>) rather than kernelized Support Vector Classifiers (<code class="language-plaintext highlighter-rouge">SVC</code>), as no feature mapping or kernel transformations are performed. The models are strictly linear.</p>

<h2 id="implementation-stack">Implementation Stack</h2>

<h3 id="1-eigen--openblas">1. Eigen + OpenBLAS</h3>

<ul>
  <li>Eigen provides matrix abstractions and expression templates.</li>
  <li>OpenBLAS accelerates low-level BLAS operations.</li>
</ul>

<p>Key goals:</p>

<ul>
  <li>Cache-friendly layouts</li>
  <li>Minimal heap allocations</li>
  <li>Efficient vectorized dot products</li>
  <li>Reduced abstraction overhead</li>
</ul>

<p>The codebase avoids redundant temporary objects and prioritizes cache-local updates.</p>

<h3 id="2-openmp-threading-strategy">2. OpenMP Threading Strategy</h3>

<p>Glacier defaults to using half of the available hardware threads. Limiting thread saturation mitigates thermal throttling, stabilizes sustained performance, and preserves overall system responsiveness. Because scikit-learn’s thread usage varies depending on the underlying BLAS backend, differences in thread policy can significantly impact comparative training times.</p>

<h2 id="performance-observations">Performance Observations</h2>

<h3 id="speed">Speed</h3>

<p>The classifier model showed <strong>4–10× faster training</strong> compared to scikit-learn in local tests.</p>

<p><img src="/assets/blog3/graph.jpeg" alt="Benchmark result" /></p>

<p>Crucially, these comparisons are not strictly apples-to-apples due to differences in convergence criteria, memory layouts, and learning rate scheduling.</p>

<h3 id="underfitting-on-give-me-some-credit">Underfitting on “Give Me Some Credit”</h3>

<p>Both implementations underfit on the “Give Me Some Credit” dataset.</p>

<p><img src="/assets/blog3/sklearn_matrix.jpeg" alt="sklearn confusion matrix" />  <br />
<img src="/assets/blog3/glacier_matrix.jpeg" alt="Glacier.ML confusion matrix" /></p>

<p>This is evidenced by poor recall on minority classes and overlapping confusion matrices. This occurs because the data requires a non-linear decision surface, which a strictly linear SVM cannot resolve without kernel expansion or feature engineering. This behavior is expected.</p>

<h2 id="architectural-decisions">Architectural Decisions</h2>

<p>Core design choices:</p>

<ul>
  <li>Use primal optimization (PEGASOS)</li>
  <li>Avoid kernel complexity at this stage</li>
  <li>Emphasize hardware-level control</li>
  <li>Optimize CPU usage consciously</li>
  <li>Maintain explicit control over memory and threading</li>
</ul>

<p>The codebase is not a direct transcription of any single reference; it is the result of iterative derivation, benchmarking, and refinement.</p>

<h2 id="learning-with-ai-vs-copying-from-ai">Learning with AI vs. Copying from AI</h2>

<p>Mathematical derivations and implementation details were developed with the assistance of AI tools. The distinction between learning and copying lies in the ability to re-derive the objective function, implement the solver from scratch, and understand convergence behavior, regularization, and hardware trade-offs. Re-implementing a library from first principles guarantees conceptual ownership and clarity.</p>

<p>AI tools functioned as a refinement engine, error detector, and a resource to accelerate textbook research, while the core architectural decisions remained strictly manual.</p>

<h2 id="systems-level-insights">Systems-Level Insights</h2>

<p>Implementing PEGASOS from scratch highlights systems-level challenges like memory bandwidth limits, thread synchronization, numerical stability in subgradient descent, and learning rate scheduling. The exercise bridges algorithmic theory with hardware-aware optimization.</p>

<h2 id="future-scope">Future Scope</h2>

<p>The next structural milestone is CUDA acceleration. This will enable:</p>

<ul>
  <li>Large-scale mini-batch updates</li>
  <li>Faster convergence on high-dimensional datasets</li>
  <li>Experimental work in GPU-accelerated optimization</li>
</ul>

<p>This work is planned for subsequent iterations.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Glacier’s SVM implementations leverage the PEGASOS primal optimization method in C++, integrating Eigen, OpenMP, and OpenBLAS to achieve high training speeds. They underfit predictably on non-linear datasets, reflecting expected theoretical limits.</p>

<p>The primary value of Glacier lies in establishing vertical control—bridging mathematical theory, memory layout, thread scheduling, and machine execution. This foundational understanding compounds.</p>]]></content><author><name>Skandan C Yalagach</name><email>skandanyalagach@gmail.com</email><uri>https://skandanyal.github.io</uri></author><category term="SVM" /><category term="PEGASOS" /><summary type="html"><![CDATA[Implemented PEGASOS SVM classifier and regression models in Glacier.ML - noted 4-10x performance speed-ups in training + inference times against Scikit-learn SGD.]]></summary></entry><entry><title type="html">Optimized k Nearest Neighbours: From Naive to Hardware-Aware</title><link href="https://skandanyal.github.io/from_math_to_machines/gb2/" rel="alternate" type="text/html" title="Optimized k Nearest Neighbours: From Naive to Hardware-Aware" /><published>2026-01-03T00:00:00+00:00</published><updated>2026-01-03T00:00:00+00:00</updated><id>https://skandanyal.github.io/from_math_to_machines/gb2</id><content type="html" xml:base="https://skandanyal.github.io/from_math_to_machines/gb2/"><![CDATA[<p>k-Nearest Neighbors (kNN) is a fundamental supervised machine learning algorithm. As a lazy 
learner, the kNN kernel performs all computation during inference rather than requiring an explicit training phase.</p>

<p>This work shows that, when optimized systematically across mathematical, software, and hardware 
dimensions, kNN execution speed can be pushed closer to hardware limits, without modifying the underlying 
algorithm or target CPU.</p>

<p>By following a structured optimization path, the kNN Classifier and Regressor implementations achieve
up to a <code class="language-plaintext highlighter-rouge">250x</code> speedup over the naive baseline.</p>

<h2 id="a-structured-path-to-performance">A Structured Path to Performance</h2>

<p>The optimization followed a clear and layered progression:</p>
<ol>
  <li>Naive implementation of the mathematical model</li>
  <li>Mathematical optimizations of the same model</li>
  <li>Software-centric optimizations</li>
  <li>Hardware-centric optimizations</li>
</ol>

<p>Each layer presented their own bottlenecks and opportunities.</p>

<h2 id="mathematical-optimizations">Mathematical optimizations</h2>

<h3 id="efficient-selection-of-neighbors">Efficient Selection of Neighbors</h3>
<p>A significant portion of inference time is spent searching for the $k$ nearest neighbors. While a 
naive approach sorts or uses a <code class="language-plaintext highlighter-rouge">std::priority_queue</code> to retrieve the top elements, a more computationally efficient 
alternative uses <code class="language-plaintext highlighter-rouge">std::nth_element</code> to partition the vector in $O(N)$ average time.</p>

<h2 id="software-centric-optimizations">Software-centric optimizations</h2>

<h3 id="efficient-data-layout">Efficient Data Layout</h3>
<ul>
  <li><strong>Contiguous Row-Major Representation:</strong> The 2D feature matrix, originally represented as nested vectors (<code class="language-plaintext highlighter-rouge">std::vector&lt;std::vector&lt;float&gt;&gt;</code>), was flattened into a contiguous 1D <code class="language-plaintext highlighter-rouge">std::vector&lt;float&gt;</code> using the mapping:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>X[row * num_cols + col]
</code></pre></div>    </div>
    <p>This flat layout offers:</p>
  </li>
  <li>Reduced cache misses due to spatial locality.</li>
  <li>Predictable strided memory access.</li>
  <li>Enablement of compiler auto-vectorization (SIMD) on the contiguous memory block.</li>
</ul>

<h2 id="hardware-centric-optimizations">Hardware-centric optimizations</h2>

<h3 id="parallelization-using-openmp">Parallelization using OpenMP</h3>
<ul>
  <li>Multithreading enables concurrent execution across CPU cores, reducing overall execution time despite minor synchronization overhead. Using OpenMP, this parallelization is enabled at compile time with the <code class="language-plaintext highlighter-rouge">-fopenmp</code> flag.</li>
  <li>On an AMD Ryzen 5 6600H CPU (6 cores, 12 threads), allocating half of the available hardware threads (6 threads) yielded the highest efficiency by avoiding hyperthreading overhead and thread contention.</li>
</ul>

<h3 id="simd-vectorization">SIMD Vectorization</h3>
<ul>
  <li>Distance computation relies heavily on sum reductions. Compiling with <code class="language-plaintext highlighter-rouge">-march=native</code> allows the compiler to leverage SIMD vector extensions (such as AVX2 or AVX-512) for automated reduction vectorization.</li>
</ul>

<h3 id="compiler-flag-optimizations">Compiler Flag Optimizations</h3>
<ul>
  <li>Switching from <code class="language-plaintext highlighter-rouge">-O0</code> to <code class="language-plaintext highlighter-rouge">-O3</code> enables aggressive compiler optimizations.</li>
  <li><code class="language-plaintext highlighter-rouge">-ffast-math</code> permits additional optimizations (like reassociating floating-point operations) at the expense of strict IEEE 754 compliance, speeding up computations at the cost of strict numerical precision.</li>
</ul>

<h3 id="profiling">Profiling</h3>
<ul>
  <li>The <code class="language-plaintext highlighter-rouge">perf</code> utility was used to profile the executable and identify hot spots in the execution path.</li>
</ul>

<h2 id="results">Results</h2>
<p><img src="/assets/blog2/benchmarking_result_knnc.jpeg" alt="KNNC Results" /></p>
<ul>
  <li>Datasets ranged in size from $500 \times 10$ to $140,000 \times 10$.</li>
  <li>The optimized implementation achieved a $100\times$ to $250\times$ speedup compared to unoptimized baseline builds.</li>
  <li>The optimized implementation remains $4\times$ to $36\times$ slower than scikit-learn’s implementation under identical hyperparameters, demonstrating the sophistication of production-grade libraries.</li>
  <li>The optimization infrastructure introduced additional code complexity and increased compile times.</li>
</ul>

<h3 id="benchmarking-setup">Benchmarking Setup</h3>
<ul>
  <li><strong>Hardware:</strong> AMD Ryzen 5 6600H CPU, 8GB RAM.</li>
  <li><strong>Environment:</strong> All manual background processes were closed, and the machine was set to performance power mode on AC power.</li>
</ul>

<h2 id="key-learnings">Key Learnings</h2>

<ul>
  <li>A clear distinction exists between a mathematical model on paper, a direct code translation, and a hardware-aware, optimized implementation.</li>
  <li>Understanding operating system scheduling and CPU cache hierarchies is critical for systems programming.</li>
  <li>Profiling tools (like <code class="language-plaintext highlighter-rouge">perf</code>) and integrated environments are essential for identifying micro-architectural bottlenecks.</li>
  <li>High-level libraries and wrappers abstract away significant underlying systems-engineering complexity.</li>
</ul>]]></content><author><name>Skandan C Yalagach</name><email>skandanyalagach@gmail.com</email><uri>https://skandanyal.github.io</uri></author><category term="kNN" /><category term="Hardware optimization" /><category term="Software optimizations" /><summary type="html"><![CDATA[Implementated optimizations to kNN Regressor and Classifier models, observed 100x inference speed-ups compared to naive version.]]></summary></entry><entry><title type="html">Building my own C++20 Numerical Algorithms library</title><link href="https://skandanyal.github.io/from_math_to_machines/gb1/" rel="alternate" type="text/html" title="Building my own C++20 Numerical Algorithms library" /><published>2025-12-15T00:00:00+00:00</published><updated>2025-12-15T00:00:00+00:00</updated><id>https://skandanyal.github.io/from_math_to_machines/gb1</id><content type="html" xml:base="https://skandanyal.github.io/from_math_to_machines/gb1/"><![CDATA[<p>Glacier.ML is a header-only numerical algorithms library implemented in C++20, utilizing Eigen for linear algebra operations where appropriate.</p>

<p><img src="/assets/blog1/Pasted_image_1.png" alt="Glacier.ML" />
The project originated as a practical follow-up to coursework in multivariate statistical modeling, specifically focusing on linear regression and its evaluation metrics. To translate theory into code, I used Stanford Online’s Statistical Learning lectures as a mathematical reference, building the logic without higher-level machine learning frameworks.</p>

<p>An initial prototype was reviewed by my artificial intelligence professor, whose feedback motivated me to expand the experiment into a more comprehensive library.</p>

<p>At present, Glacier.ML implements three stable models:</p>
<ul>
  <li>Simple Linear Regression</li>
  <li>Multiple Linear Regression</li>
  <li>Binary Logistic Regression</li>
</ul>

<p>The logistic regression implementation has been trained, tested, and validated on two real-world datasets:</p>
<ul>
  <li>Pima Indians Diabetes Dataset (768 × 9)</li>
  <li>Wisconsin Diagnostic Breast Cancer Dataset (569 × 32)</li>
</ul>

<p><img src="/assets/blog1/Pasted_image_2.png" alt="Confusion matrix 1" />
<img src="/assets/blog1/Pasted_image_3.png" alt="Confusion matrix 2" />
Confusion matrices for the datasets Pima Indians Diabetes Database and Wisconsin Cancer Diagnostic Dataset respectively
Press enter or click to view image in full size</p>

<p><img src="/assets/blog1/Pasted_image_4.png" alt="Comparison" />
Benchmarking training time of Glacier’s Logistic Regression against Scikit-learn’s Logistic Regression
Evaluation metrics and training times were benchmarked against scikit-learn’s logistic regression. Despite lacking explicit optimization and parallelism, Glacier.ML achieved comparable accuracy and training speed in these tests.</p>

<p>This implementation highlighted low-level numerical stability challenges, such as floating-point underflow, which are typically managed behind the scenes in higher-level ML libraries.</p>

<p>Below is a minimal example demonstrating dataset ingestion, training, prediction, and evaluation using Glacier.ML’s binary logistic regression pipeline.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"Glacier/Models/MLmodel.hpp"</span><span class="cp">
#include</span> <span class="cpf">"Glacier/Utils/utilities.hpp"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;&gt;</span> <span class="n">X</span><span class="p">,</span> <span class="n">X_t</span><span class="p">;</span>
    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;</span> <span class="n">y</span><span class="p">,</span> <span class="n">y_t</span><span class="p">;</span>

    <span class="n">Glacier</span><span class="o">::</span><span class="n">Utils</span><span class="o">::</span><span class="n">read_csv_c</span><span class="p">(</span><span class="s">"../Datasets/training_dataset.csv"</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>
    <span class="n">Glacier</span><span class="o">::</span><span class="n">Utils</span><span class="o">::</span><span class="n">read_csv_c</span><span class="p">(</span><span class="s">"../Datasets/testing_dataset.csv"</span><span class="p">,</span> <span class="n">X_t</span><span class="p">,</span> <span class="n">y_t</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>

    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;&gt;</span> <span class="n">X_p</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span> <span class="p">....</span> <span class="n">n</span><span class="p">}</span>
    <span class="p">};</span>
    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;</span> <span class="n">y_p</span> <span class="o">=</span> <span class="p">{</span><span class="s">"label_1"</span><span class="p">};</span>

    <span class="n">Glacier</span><span class="o">::</span><span class="n">Models</span><span class="o">::</span><span class="n">MLmodel</span> <span class="n">md</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>

    <span class="kt">float</span> <span class="n">hp1</span> <span class="o">=</span> <span class="mf">1.0</span><span class="n">f</span><span class="p">;</span>
    <span class="n">md</span><span class="p">.</span><span class="n">train</span><span class="p">(</span><span class="n">hp1</span><span class="p">);</span>

    <span class="k">auto</span> <span class="n">md_pred</span> <span class="o">=</span> <span class="n">md</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_p</span><span class="p">);</span>
    <span class="n">md</span><span class="p">.</span><span class="n">analyze_2_targets</span><span class="p">(</span><span class="n">X_t</span><span class="p">,</span> <span class="n">y_t</span><span class="p">);</span>

    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This example illustrates Glacier.ML’s core design goal: providing a minimal, explicit training and evaluation pipeline without hidden abstractions. Hyperparameters, data ingestion, and evaluation remain directly under user control.</p>

<p>The source code is available on <a href="https://github.com/skandanyal/Glacier.ML">GitHub</a>.</p>]]></content><author><name>Skandan C Yalagach</name><email>skandanyalagach@gmail.com</email><uri>https://skandanyal.github.io</uri></author><category term="Glacier.ML" /><category term="Systems-ML" /><category term="Machine learning library" /><category term="Linear Regression" /><category term="Logistic Regression" /><summary type="html"><![CDATA[Introducing Glacier.ML - a C++ based Supervised ML library, written from scratch to understand theoretical ML.]]></summary></entry></feed>