Reading the notation. The model only ever predicts whole 6-mers, so it never directly
reports a probability for a single base. p(1)(A) is what we want:
the probability that position 1 of the next 6-mer is A. We recover it by
marginalizing, summing the probability of every 6-mer that carries A
in slot 1 while the other five positions range over all bases (the shuffling boxes, written
Σ*).
Computing the loss. Doing this at each slot gives six per-position probabilities, one
for each base of the ground-truth 6-mer. Their product is the model's probability of getting
every position right; taking −log turns it into a loss, and
dividing by 6 averages it per base. That is the loss = − ¹⁄₆ log(…)
expression above. Because the credit is split across positions, a near-miss like
TATATT still scores well for the five bases it got
right, unlike plain 6-mer cross-entropy, which treats any imperfect token as equally wrong.