Finding good variables

In Cube and Conquer, the look-ahead heuristic (in all its variations) seems to be the most popular heuristic to find splitting variabls - it is certainly the most discussed in books. It is also used for example in Marijns papers on the Pythagorean Triples (2016) and Schur Number 5 (2017).

So I assumed that look-ahead is just the standard way to go in modern CnC.

But apparently, look-ahead has been a bit(?) abandoned in the last years, at least for difficult CnC problems in Marijns group. Since 2018, they switched back to manual variable selection for some of their papers. This means distilling in weeks of delicate work the 10–20 most important variables of a given instance by hand, getting to know them one by one.

One reason I can see now is that these combinatorial problems are so hard that just splitting doesn’t cut it anyways, they also require some highly sophisticated encoding optimizations. And these optimizations have to be developed by hand, which requires weeks (also months) of careful study of the individual problem – at which point good splitting variables might just emerge naturally, and much more robust than any look-ahead procedure could find in a few hours.

This years’ SAT conference now offered a new automated approach for splitting, by Zachary Battleman et al.: Run a CDCL solver on the given instance for a few seconds, say until ca. 100.000 proof steps are logged, then count the number of appearances of each variable in that partial proof. Choose the most-occuring variable as the next splitting candidate. (See also this 2-page summary by Zach). This splitting procedure seems to work quite well!

Good variables float to the top — Zachary Battleman et al.: Problem Partitioning via Proof Prefixes (SAT 2025)

The plot shows how often a variable occurs in the growing proof. Colored variables are those the research group considers to be “good” splitting candidates by earlier experience. And there seems to be a strong correlation.

By just picking the Top-15 variables by count, odds are high that we get 10 or more “good” variables. Of which there exist only 10-100 per instance, say 50 on average. Which required intensive manual work to figure out. And here, in a few seconds, we are served with probability 10/15 some of them, compared to the baseline chance of 50/100000. Thats a really strong signal.

I am curious what will come out of this automated counting. It is certainly in the right place now to be examined further, being developed square within Marijns group.

I also wonder whether the CDCL-world can learn something from this strong counting signal.