A Technical Gibberish Post: Lorem Gradients and Ipsum Solvers

3 minute read

Published:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. The point of this post is not to say anything — it is to verify that the blog’s math pipeline (kramdown → MathJax 2.7 with TeX-MML-AM_CHTML) renders everything a technical post is likely to throw at it: inline math, display math, aligned derivations, numbered equations, symbols, code, and collapsible “show-the-derivation” toggles.

1. The ipsum objective

The clean, reader-friendly statement of the “problem” is just one equation:

\[\theta^\star = \arg\min_{\theta \in \Theta}\; \mathbb{E}_{x \sim p}\!\Bigl[\,\ell\bigl(f_\theta(x),\, y(x)\bigr)\Bigr] + \lambda\, \mathcal{R}(\theta). \label{eq:objective}\]

Here \(\theta \in \mathbb{R}^d\) are the parameters, \(\mathcal{R}\) is a (fake) regularizer, and \(\lambda \ge 0\) is a (fake) strength. That is all a casual reader needs. A careful reader can expand the derivation below.

Derive the stochastic gradient of $$\mathcal{L}(\theta)$$

Define the per-sample loss \(L_\theta(x) := \ell(f_\theta(x), y(x))\). Differentiating under the integral sign (which we pretend is justified here by the dominated lorem theorem):

\[\begin{aligned} \nabla_\theta \mathcal{L}(\theta) &= \nabla_\theta \int L_\theta(x)\, p(x)\, \mathrm{d}x \\ &= \int \nabla_\theta L_\theta(x)\, p(x)\, \mathrm{d}x \\ &= \mathbb{E}_{x \sim p}\!\bigl[\nabla_\theta L_\theta(x)\bigr]. \end{aligned}\]

Given an i.i.d. mini-batch \(\{x_i\}_{i=1}^{B}\), the unbiased Monte Carlo estimator is

\[\hat g(\theta) \;=\; \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L_\theta(x_i), \qquad \mathbb{E}[\hat g] = \nabla_\theta \mathcal{L}(\theta).\]

Adding the regularizer gives the update rule

\[\theta_{t+1} \;=\; \theta_t - \eta_t \bigl(\hat g(\theta_t) + \lambda\, \nabla \mathcal{R}(\theta_t)\bigr).\]

which is what the fake algorithm in the next section implements. $\blacksquare$

2. Inline math in prose

Inline math is written the same way, just without blank lines around it. The condition number \(\kappa(H) = \lambda_\max(H)/\lambda_\min(H)\) controls step-size sensitivity, and the Lipschitz constant \(L_{\nabla \mathcal{L}}\) bounds the safe learning rate by \(\eta \le 1/L_{\nabla \mathcal{L}}\) (well-known in gibberish optimization).

3. A gibberish algorithm

import numpy as np

def lorem_solver(x, steps=100, lr=1e-2, reg=1e-3):
    """SGD on the ipsum loss. Does nothing useful; for style only."""
    theta = np.zeros_like(x)
    for t in range(steps):
        g = np.sin(theta) - 0.1 * x          # "per-sample gradient"
        theta = theta - lr * (g + reg * theta)
        if t % 10 == 0:
            print(f"step {t:3d}  ||theta||={np.linalg.norm(theta):.4f}")
    return theta

A shell snippet, because the theme styles those too:

$ bundle exec jekyll serve --drafts
Configuration file: _config.yml
            Source: /khegazy.github.io
       Destination: /khegazy.github.io/_site
      Generating... done.

4. A multi-step “proof” behind a toggle

The following claim is false, but its presentation exercises the toggle component with a longer body:

Claim (spurious). For any ipsum-smooth \(\mathcal{L}\) and any \(\theta_0\), SGD with constant step size converges to a global minimum.

Fake proof (expand at your own risk)

Step 1 — descent lemma (pretend). By Taylor expansion around \(\theta_t\),

\[\mathcal{L}(\theta_{t+1}) \le \mathcal{L}(\theta_t) + \langle \nabla \mathcal{L}(\theta_t),\, \theta_{t+1} - \theta_t \rangle + \tfrac{L}{2}\,\lVert \theta_{t+1} - \theta_t \rVert^2.\]

Step 2 — substitute the update. With \(\theta_{t+1} = \theta_t - \eta \hat g_t\) and \(\mathbb{E}[\hat g_t] = \nabla \mathcal{L}(\theta_t)\),

\[\mathbb{E}\bigl[\mathcal{L}(\theta_{t+1})\bigr] \le \mathcal{L}(\theta_t) - \eta\,\lVert \nabla \mathcal{L}(\theta_t) \rVert^2 + \tfrac{\eta^2 L}{2}\,\mathbb{E}\bigl[\lVert \hat g_t \rVert^2\bigr].\]

Step 3 — telescoping. Summing over \(t = 0, \ldots, T-1\):

\[\sum_{t=0}^{T-1} \eta\, \mathbb{E}\bigl[\lVert \nabla \mathcal{L}(\theta_t) \rVert^2\bigr] \le \mathcal{L}(\theta_0) - \mathcal{L}^\star + \tfrac{\eta^2 L}{2} \sum_{t=0}^{T-1} \sigma^2.\]

Step 4 — “therefore”. We wave hands vigorously and declare the claim proven. This step is load-bearing. $\blacksquare$

(The actual convergence result is only to a stationary point in expectation, and only with diminishing step sizes for non-convex \(\mathcal{L}\) — the proof above is intentionally broken.)

5. A small table

MethodLorem $\downarrow$Ipsum $\uparrow$Dolor (s)
Baseline12.340.71242.0
Ours (tiny)9.810.80328.5
Ours (big)6.020.87161.7

Numbers above are fabricated. End of placeholder.