The source paper reports competitive results on three tabular datasets and a much weaker DIGITS result. A reproduction should pin the source revision, data splits, preprocessing, seeds, initial energy, structural action costs, phase windows, and all hyperparameters.
Reproduce more than final accuracy
Retain hypothesis lineages, energy traces, candidate-action scores, transition rate, noop streaks, freeze step, and ablation runs. Final accuracy alone cannot establish the dynamics.
Stress the listed limitations
Tests should vary feature dimension, class count, parameter correlation, initial energy, score calibration, and action vocabulary. Independent negative results are useful evidence.