Log in

26 November 2009 @ 01:56 am
What If ...? #4 - Sudoku Retesting  
About four weeks ago, a former sudoku champion rang a bell that cannot be unrung by claiming a fellow finalist was a cheater. While a lot of circumstantial evidence supported this case, no smoking gun was likely to be found from looking at video and photography from the event itself. Sudoku-related curiosities such as "unreasonable sticking points" or "missing pencilmarks" were unlikely to satisfy all parties necessary to disqualify the competitor. However, the tournament organizers came up with a process that could clarify matters: the contestant would be offered a second chance to solve more sudoku puzzles under closer watch to demonstrate his proficiency and to deflect the allegations of cheating. This leads to this week's topic:

What if you had to conduct a retest for sudoku? What kind of standardized puzzles or times would suffice to conclusively prove either the ability or inability to finish three puzzles in 14 minutes? How many puzzles would be needed and what would these be? Can sudoku be made "retestable" to indeed prove Eugene is detestable?

These questions necessarily lead to a discussion of mean and variance and nuances of the solving process; a fitting answer requires a careful analysis of how an actual human solver performs from puzzle to puzzle in a controlled setting. Fortunately, I know an eccentric scientist with an advanced degree in "Number Placement" who has a relevant manuscript to share with you:

Effects of Transformations of Sudoku Puzzles on Solution Times of an Advanced Solver
A Case Study by Dr. Sudoku


We generated five isomorphic copies of five sudoku puzzles using digit reassignment, row/column/chute swapping, and rotation/reflection. A sixth puzzle with an unusual solving property was also transformed and added to the set for twenty-six test puzzles. Our subject, an advanced division contestant at the recent Sudoku National Championship, was able to identify the grid with the unusual solving property from the set of 26 puzzles, and also identified several commonalities within the remaining isomorphic puzzles. Considering his solution times, a 15-20% standard deviation in time was observed across the isomorphic versions of each puzzle. The fastest/slowest grid outliers were then resolved without further transformation. The subject's results suggest that the likely mean and standard deviation in solving time on an exact duplicate is very close to that of a transformed grid.

We also examined how transformations would affect computer ratings, specifically those of Scanraid Sudoku. Both number ordering and top-to-bottom, left-to-right biases were observed in the grading of isomorphic puzzles. In one extreme case, a 90 degree rotation of a puzzle led to a 50% increase in the rating for the puzzle. While the varied ratings may apply to the most esoteric human solvers, who persist with perfect top to bottom or 1 to 9 scanning, these ratings in general seem inconsistent with the expected human solving experience of sudoku.


Sudoku is a logic-based constraint-satisfaction puzzle, most typically presented on a 9x9 grid of squares that must be filled with numbers from 1-9 such that no number repeats in any row, column, or 3x3 bold region; other sizes and the use of symbols other than numbers in the cells are possible. The regions add a third constraint onto each cell which separates sudoku from Latin Square-based puzzles with solely row and column constraints. Despite some reports to the contrary [Crook, 2009], sudoku puzzles must have a single solution to be considered valid; valid sudoku with as few as 17 given numbers are known, although this has not yet been proven to be the lower bound.

Following years of research and individual experiences with making deductions in sudoku, a number of strategies have been formalized for their solution [see, for example, Scanraid Sudoku solver; Stuart (2005-9), also Stuart (2007)]. Many of the basic strategies fit into the categories of either positive inference or negative inference depending on whether the presence, or absence, of particular numbers is used in the placement of further digits. Different solvers may vary in their preferences for or abilities with positive versus negative inference which presents a challenge for the "retesting" of a competitor, as in the recent Varshavsky case [Philadelphia Inquirer, 2009], as a high variance in solving times might be encountered by a particular solver if presented a random sudoku to solve in under 5 minutes. Additionally, the breadth of a solution path can also influence the solving time, with a smaller variance of times expected for a puzzle with many places to identify digits by logical means compared to those with fewer such places. Given the puzzle by puzzle variation seen in sudoku, single puzzle finals (and playoffs in general) are not viewed by many members of the advanced solving community as an appropriate way to select the best overall sudoku solver [see motris.livejournal.com].

An individual sudoku puzzle can be transformed into a large number of isomorphic grids using number reordering, row/column/chute swapping, and rotations/reflections. For this reason, while 6,670,903,752,021,072,936,960 possible solutions have been reported for sudoku, only 5,472,730,538 are considered "essentially different". [Russel, Jarvis, 2005 and 2006] It is speculated, but has not yet been demonstrated, that isomorphic puzzles can lead to highly variable solution times due to various esoteric tendencies of the solver. A solver who approaches sudoku in a cyclic 1-9 search order may be affected by reassigning the values of the givens. Similarly, a solver who starts from the top to bottom or from the left to the right could be affected by row/column/chute swapping as well as rotations or diagonal mirroring. However, despite these nuances of the solution process, the essential steps along the solution path will be unchanged in these isomorphic grids. If one had to retest a competitor, isomorphic forms of the puzzle might be valid for such a test. If the competitor had, for example, memorized the solution of a particular puzzle (even in an extreme case where he has memorized an order of digits to enter, perhaps from receiving a copy of the puzzle before the test and needing to reproduce a "solution path" while under increased scrutiny), an isomorphic version of the same puzzle will be protected from such an attack as it will be impossible to memorize the approximately one trillion isomorphic grids that are possible for any individual puzzle. Further, if a solver claims he is better at, say, negative inference - excelling at naked singles but not hidden singles for example - and tries to use this to explain a much slower time for a particular puzzle, the use of an isomorphic grid containing an identical number of potential starting points for a particular kind of logic rule will negate any "variation of solving method" rebuttal for a poor retest. (Of course, according to Thomas Snyder [personal communication], the most fascinating element of the Varshavsky case was that after 4 weeks of being under suspicion, the competitor actually showed up for a retest without any practice to make it remotely possible he could deceive the invigilators, even on identical puzzles to the ones under suspicion! We are therefore considering the case of catching smarter cheaters, with an altered ratio of intelligence to hubris, in this study).

Here, we have tested a set of puzzles randomly ordered following transformations to determine how consistent or inconsistent the solving process is. We also added a single puzzle with a particularly unusual solving property into the mix - and alerted the subject to its presence - to see if the step would be encountered on his solving path and identifiable amidst many other puzzles.

Materials and Methods:

A perl script was written to output a specified number of isomorphic forms of an input classic 9x9 sudoku puzzle by performing number replacement, then row and horizontal chute permutations, then column and vertical chute permutations, and then possibly performing a rotation/reflection to reach the final altered state. 5 puzzles (3 of approximately 4 star difficulty, 2 of approximately 5 star difficulty, all appropriate to reward logical methods instead of bifurcation using the identification of pairs and triples as the hardest steps by Scanraid analysis) were selected for analysis and five versions of each were generated. A sixth puzzle with an unusual solving property, specifically an Order 5 Franciscan Rectangle that fits into a class of "uniqueness" techniques, was chosen and randomized as well. This advanced technique is not required to solve the puzzle, but is a potential shortcut an advanced solver might find and exploit. All puzzles had between 24 and 28 givens and the number of givens was not unique for any of the six puzzles.

The 26 puzzles (now labeled A-Z) were printed on 4.5" by 4.5" square grids, identical to the size used at the Sudoku National Championship, and then solved in batches of 5 (in one case 6) puzzles at a time with short 5 minute breaks. A recorded, pre-watched, episode of "Sportscenter" was playing in the background throughout the testing as a kind of white noise. The subject, selected for being a high-finishing participant in many sudoku championships, chose to solve the puzzles while seated in a comfy chair, writing against a notebook surface with .9 mm Pentel Twist-Erase III pencils. The subject timed himself on the solution of these puzzles, and the finished grids were checked for completeness of filling, but not checked for accuracy. From anecdotal evidence provided by the subject, about 95% of puzzles will be correctly solved in this manner so, in this study, approximately 1 puzzle may have an error such as a transposition.

In addition to the human testing, the puzzles were also scored in Scanraid Sudoku to determine how "computer rating" itself varies as a result of grid transformations.

The 26 puzzles are provided in Appendix B.


The subject's initial solve data (and paper notes) follow:

A - 3'22", B - 2'39", C - 3'56", D - 3'57" (franciscan-like), E - 3'41", F - 2'40", G - 3'3", H - 1'48", I - 2'12", J - 2'2", K - 2'27", L - 3'31" (Order 5 Franciscan), M - 3'18", N - 3'1", O - 1'52", P - 4'21" (franciscan-like), Q - 1'39", R - 1'53", S - 3'16", T - 3'17" (franciscan-like), U - 3'41" (franciscan-like), V - 3'7", W - 2'53", X - 2'6", Y - 3'35", Z - 2'1"

The subject completed the 26 puzzles in a range of times from as short as 99 seconds to as long as 261 seconds. The average solving time was approximately 2 minutes, 54 seconds, with a standard deviation of 46 seconds. The median solve time was approximately 3 minutes, 2 seconds. During solving, the subject never had to make an erasure or correction to his puzzles, which would lead to a higher than expected time for his solving process on that individual grid. If such an event had occurred, we would have suggested a future retesting of the same puzzle as "clean" solving times are of the most interest to us for comparison.

The subject also made notes on some of his puzzles. Given the expectation of a Franciscan puzzle of order 5, the subject related that "when I saw 4 similar pairs that could make such a situation, but didn't resolve properly, I noted 'franciscan-like' at the end of the solution process." For 4 solutions (D, P, T, U) this text was indeed written in the margins of that test paper. For a fifth puzzle, L, Order 5 Franciscan was written. Puzzle L is indeed the sudoku with an Order 5 Franciscan Rectangle. D, P, T, U (and M) are all isomorphic forms of another puzzle. It is unclear why the subject did not mark M as Franciscan-like but it is within a second of the fastest time over the trials for this puzzle meaning the critical steps may have been identified before the relevant pair property was noted in a period of being "stuck". It also occurred in the trial immediately after the Order 5 Franciscan grid (L) and could have been missed due to increased laxity after identifying the hidden sixth puzzle.

After the testing, the subject also noted to experimenters that "deja vu" feelings about the starting steps of puzzles would happen such as "I've chased this kind of digit by pointing pairs around the grid before." While we did not ask the subject to try to isolate each individual puzzle, the solver indicated that if he noted the kinds of starting steps, as well as sticking points, he could possibly have accomplished this assignment with some accuracy. Because of this alleged "memory" of a puzzle, we did consider whether the solver got "better" at a particular puzzle, even with isomorphism, as the trials went on.

Rearranging the 26 puzzles into the 6 unique types, and ordering by apparent difficulty, these are the subject's times in chronological order:

1 - HJKQZ - 1'48", 2'2", 2'27", 1'39", 2'1" - average time 1'59", deviation 18", range 1'39"-2'27"
2 - BGOVX - 2'39", 3'3", 1'52", 3'7", 2'6" - average time 2'33", deviation 33", range 1'52"-3'3"
3 - EFIRY - 3'41", 2'40", 2'12", 1'53", 3'35" - average time 2'48", deviation 48", range 1'53"-3'41"
4 - ACNSW - 3'22", 3'56", 3'1", 3'16", 2'53" - average time 3'18", deviation 24", range 2'53"-3'56"
5 - DMPTU - 3'57", 3'18", 4'21", 3'17", 3'41" - average time 3'43", deviation 27", range 3'17"-4'21"
6 - L - 3'41"

The tightest solution times exist for the puzzle with the fastest solution, but the next two tightest distributions are for the two puzzles that respectively take the longest. When sorted by average solve time, both the fastest time on a puzzle, and the slowest time on a puzzle (the "range") are also in the indicated 1-5 order. Considering the relative position of the fastest versus the slowest solve, for these three puzzles the fastest solutions were #4, 3, 4, 5, and 4 (average of 4.0) and the slowest were #3, 4, 1, 2, 3 (average of 2.6) which suggests a slight possibility of some learning throughout the process although in a few cases, as on the puzzle with the highest variance (3), a very fast fourth solution was followed by a much slower fifth solution (by 102 seconds) that was only slight ahead of the overall slowest solution so a matter of a few seconds would change a 1 to a 5 and affect this averaging. Another hypothesis that accounts for the results is that, after solving several puzzles, the subject had "warmed up" and was more likely to achieve faster times.

Following a day off, the outliers - the 10 puzzles with the fastest and slowest respective times - were resolved in random order to determine if these differences were based on variations in the grids themselves, or by the stochastic nature of the subject's solution process. This test would also help determine if the "exact" puzzles have more consistent times than permuted grids.

Q - 1'39" first time, 1'47" second time, +8 seconds difference
K - 2'37" first time, 2'55" second time, +18 seconds difference

O - 1'52" first time, 2'40" second time, +48 seconds difference
V - 3'7" first time, 2'14" second time, -53 seconds difference

R - 1'53" first time, 2'9" second time, +16 seconds difference
E - 3'41" first time, 2'40" second time, -61 seconds difference

W - 2'43" first time, 2'43" second time, 0 seconds difference
C - 3'56" first time, 2'23" second time, -93 seconds difference

T - 3'17" first time, 3'38" second time, +21 seconds difference
P - 4'21" first time, 2'45" second time, -96 seconds difference

In 8 of the cases, the second time on a puzzle moved towards the mean by being slower for the fastest puzzle or faster for the slowest puzzle. In one case the exact time reoccurred; in the last case, a slightly slower time was observed for the slowest time. In 3 of the 5 pairs of puzzles, the slower puzzle in the first trial was solved faster than its partner in the second iteration. In 2 cases, for C and P, the worst solution time actually led to a new fastest solving time. Collectively, these results suggest that the outliers were not perceived as easier/harder by the solver consistently but were rather the result of random variation in the subject's search process from puzzle to puzzle. Given the magnitude of the time differences in the retest, we also conclude that the variation between "exact" duplicates of a puzzle is close to identical to the variation from transforming the grid. The day off effectively eliminated memory of an individual puzzle so that it behaved as an isomorphic grid. The value of using isomorphic grids is that an identical puzzle can be tested almost immediately after another one whereas we do not expect back-to-back solutions of an exact duplicate would behave in the same way.

The 26 puzzles were then evaluated by Scanraid Sudoku to determine their "grades".

1 - HJKQZ - 74, 75, 73, 74, 75 - average 74.2, deviation .8
2 - BGOVX - 92, 102, 113, 108, 94 - average 101.8, deviation 9.0
3 - EFIRY - 154, 104, 104, 109, 150 - average 124.2, deviation 25.5
4 - ACNSW - 165, 132, 147, 156, 137 - average 147.4, deviation 13.5
5 - DMPTU - 139, 139, 147, 130, 143 - average 139.6, deviation 6.3
6 - L - 96

Scanraid's score agrees on the "easiest" puzzles by rating, although it differs on the order of 4 and 5. 6 (the order 5 Franciscan rectangle) takes about as long in its single solve as the average time of the hardest puzzle, but scores much less by Scanraid. Intriguingly, the rating of some of the isomorphic grids, particularly E and Y, compared to F, I, and R, was grossly different. The subject's solving of the puzzles E and Y was also much slower than those for FIR for this highest variance case, well beyond expectation, which drew our interest.

To account for the variation observed between the isomorphic puzzles E and F, we first checked our software to confirm the puzzles were isomorphic and indeed they were. We then considered the relative Scanraid solving path for E versus F as comparison.

E displays two instances of Singles in Row/Col, then two instances of Singles in Box, then one instance of Naked Pairs/Triples, then sixteen consecutive uses of show possibles (one choice cells) solves the rest.

F displays two instances of Singles in Row/Col, then one instance of Singles in Box, then one instance of Naked Pairs/Triples, then thirteen consecutive uses of show possibles solves the rest.

The first major difference in E and F occurs in the Singles in Box step, with an extra instance of this occurring for E but with the same total number of digits (2) being placed after this step concludes. In the specific case of E, a single (7) then allows another single (6) in the same box and this is done in two steps. For F, the isomorphic position in just one step uses a single (8) to then allow another single (9). Digit order therefore seems to be hard-coded into Scanraid's approach of step counting. Swapping 6 and 7 in E does not change the overall rating from 154, but does reduce the number of iterations of Singles in Box from two to one to be more F-like.

The next major difference in the two solution paths is in the Naked Pairs/Triples step. Unfortunately, hidden pairs and triples are more common than naked pairs and triples for most solvers (and are identified one at a time, not en masse), but Scanraid identifies a slew of naked pairs/triples in a single step. In E, 1 pair (eliminating 9 candidates) and 6 triples (eliminating 16 candidates) are identified. In F, 1 pair (eliminating 9 candidates) and 7 triples (eliminating 17 candidates) are identified. The extra triple (and extra missing candidate) change the solving profile slightly at the end with 3 fewer "one option only" steps needed. This faster solution path must account for the relative difference in score.

To further examine how different transforms affect the Scanraid rating, distinct transformation types were generated and rescored for the grids E and F. Varying just the number identity of either E or F led to only small changes in the score (across five grids E varied from 151 to 163 and F varied from 104 to 113). Varying the row/horizontal chute ordering had a much larger effect on the score (E varied from 105 to 154 over five grids; F varied from 104 to 151). This shows that geometry (perhaps in top/bottom or left/right order) is also a meaningful difference in the scoring of puzzles by scanraid. Taking a mirror (across the main upper-left to lower-right diagonal) of E turns a 154 rating into a 108 rating. This transformation is minor so the large change in rating seems astounding and speaks further to top/bottom or left/right bias. Similarly, just rotating E 90 degrees counter-clockwise turns a 154 rating into a 108 rating. Since the numbers are just symbols and an 8 can be seen as an infinity symbol if looked at sideways, this is not a meaningful transformation at all but it accounts for the entire magnitude of the difference in rating between E and F. The 90 degree rotation also changes the 1 pair and 6 triples for 25 eliminated candidates step of E into the 1 pair and 7 triples for 26 eliminated candidates step seen for F.


While Scanraid does offer some value in rating the fastest versus the slowest puzzles for this subject, we do not believe, within isomorphisms of a given puzzle, that the variation in Scanraid scores is meaningfully correlated with the experience of the tested subject. We posit that isomorphic grids should receive the same/similar scores from an ideal computer rater, but minor effects including digit order and top/bottom or left/right bias in this particular algorithm lead to large differences in rating (matching the untested hypothesis that a solver that goes in 1-9 order may find numbers at a different rate). An ideal rating system may require both a reordering of steps, as well as a consideration of solving path breadth. While Stuart has not divulged all the methodologies of his system [Stuart, 2007], his scoring has been evaluated using thousands of self-reported solving times in 5 or 10 minute wide ranges. As this study focuses on an "advanced" solver, these times would not be relevant for marking difficulty for this solver as the smallest reported time (< 5 minutes) accounts for all of the solutions here. It is an open question whether "beginner" or "intermediate" solvers are making use of the same techniques as an "advanced" solver, but with less efficiency and a constant scaling factor, or if their techniques are less developed and have their properties. A further consideration for this Scanraid scoring is that, if the self-reporting solvers are using parts of the Scanraid engine to solve the puzzles, which can present all candidates to a solver in its web-interface and printouts, the difficulty of naked singles (and pairs and triples) will be biased for these times relative to standard paper and pencil solving where solvers do not begin with candidate information and this information does not get updated during the solution process.

In contrast to Scanraid's approach, the given subject's style (see appendix A for a description of his heuristic) seems more tuned to individual occurrences of digits and geometry, and is therefore blind to most of these changes. The "deja vu" feel of certain puzzles at the start or end, as well as the identification of 4 puzzles out of 5 that felt close to the Franciscan Order 5 at some stage, suggests that many commonalities in solving path are found despite the highly transformed grids. Still, in any single trial, slight differences in searching pattern, particularly on the puzzles with the narrowest solving path, can lead to variations in times. On average, a standard deviation of about 15-20% in solve times seems likely on any single puzzle at these difficulties, although in the extreme a 2x time might be observed. At these difficulties, similar to those used by the Sudoku National Championship, all puzzles are being solved in less than 4 and a half minutes. Retesting with transformed grids should not lead to extremely high variations in round time.

For this finisher of three puzzles in 9-12 minutes, being retested and taking more than 16 minutes is not expected from this study. Consider if puzzles 3-5 constituted a round of the competition. The "best possible day" for the subject, taking his fastest times across all iterations, would allow for the round to be finished in 8 minutes, 3 seconds. The mean times for the subject would lead to completing the round in 9 minutes, 49 seconds. The "worst possible day" for the subject would leave a round time of 11 minutes, 58 seconds. The gap of "best" to "worst" day is less than a 50% increase of the fastest times. The gap of "best" to "mean" is less than 20%. Of course, making an error while solving a sudoku puzzle which requires full erasure of a grid would lead to much worse times than any of the instances observed here. Such errors occur all the time, but for identifying what a solver can do these "scrub" times should not be used.

In the Varshavsky case, having observed a single data point of finishing 3 specific puzzles in 14 minutes, we can now begin to consider what upper limit of solving time would be consistent with a legitimate competitor. Consider retesting the suspect on some number of identical (meaning isomorphic) puzzles from one to one trillion for the three round puzzles (~3 of each seems reasonable). These trials would establish both a "mean" and "best" time for a solver. Certainly, if the combined "mean" over three puzzles came within a minute of or even bested the observed time, this would remove suspicion from this solver. A "best" time that does not approach even a reasonable "mean" time, such as one 50% higher than the competition time or 20 minutes, would instead suggest cheating had occurred. Without actual proof of a method of cheating, organizers could allow the suspect to continue solving the 1 trillion possible puzzles until his "best" time could beat the 14 minute time, but by then his "mean" time and variance will be very well established allowing a less biased estimate of the probability that a legitimate solving approach was used during the competition.

In performing retesting with different solvers, it is likely that certain solving styles, particularly those with a high level of bifurcation/guessing, will have a particularly high variance in solving time. This study does not contain enough participants to report on such cases. Still, consider an extreme case in which a very difficult puzzle has an obvious bifurcation spot which will lead to a solution or not. The times for this puzzle will be bimodal, say 4 or 6 minutes per puzzle, tied to a coin-flip. A lucky run of three heads would lead to a 12 minute round while an unlucky performance would lead to an 18 minute round. Neither the "mean" time in this situation, nor the "best" time, would eventually come under scrutiny, particularly as it will be obvious by direct observation that the solver employs bifurcation, has high variance in his times, and may therefore just need a 1 in 8 chance of being lucky in each of three puzzles to qualify for the stage. Indeed, direct observation of the solver during retesting, to classify his solving heuristics in addition to assessing his speed, will clarify the results of invigilation.

All considered, whatever method Varshavsky used in Philadelphia to accomplish a 14 minute time in round 3 was not resistant to discovery after retesting, although without his failure to perform even adequately in the finals he would have escaped undetected at this championship under current protocol. Following retesting, where finishing even a single puzzle in 30 minutes was unlikely, it was not a hard decision to disqualify. Consider instead the more difficult case of a competitor who, on significant retesting, can only achieve a best time of 20 minutes, with ~6.5 minutes being the fastest he can solve any puzzle with an average of 8.5 minutes per and 1 minute standard deviation. One possible cheating strategy, relatively low-tech but still well-known in many testing situations, is to situate oneself near a more talented solver. Given the two-page booklet used in these championships, and the fact most solvers start on puzzle one and then just slide the booklet over to solve the second puzzle, imagine a cheater who solves puzzle two, then copies puzzle one from his neighbor to the right (taking about one minute which is a reasonable time for copying ~60 digits into a grid), and then solves puzzle three. This solver with a 6.5 minute "best" time per puzzle could then accomplish a 14 minute round by cheating, but under retesting could never come "close" in a probabilistic sense. A sufficiently stringent retesting procedure should prove the cheating even in this difficult case where a majority of the solving of a round was legitimate, but slower than other competitors. Ideally organizers would situate proctors and physical obstructions between competitors to reduce the opportunity for such an approach (or even a lesser variant of the approach where, after solving puzzle 3, a quick glance around after a "look at the clock" would allow for the copying of a handful of digits into both puzzles 1 and 2 to get a head-start on each).

In a world with many potential "Eugenes", with valuable prizes, and with increasingly miniaturized technologies that can assist cheating, it is reassuring to know that a relevant retesting approach may exist to identify cheaters if subjected to such testing. Particularly proactive organizers may make a mandatory "retest" a requirement for a podium finisher, similar to a drug test for a winning athlete. It is further reassuring to consider that basic steps such as handing out randomly permuted forms of the same puzzles to neighboring competitors could be used to remove/minimize the potential of the most common forms of cheating including prior knowledge of the test (there is no longer one test to memorize) and copying off a neighbor for sudoku. It is unlikely such steps are possible with almost any other kind of puzzle, including other logic puzzles or crossword puzzles.

While this study may seem like a rather extensive analysis for such a basic and fundamental result on puzzle-to-puzzle variation in solving times, it is the first study of its kind and should also serve as a good foundation for further studies that use isomorphic puzzles to explore the solving process. For example, differences in time for using "notes" versus "pencilmark-less" solving, differences arising from grid size both on paper and on whiteboards, and differences arising from the use of "reckless bifurcation" versus logical methods on puzzles of varying difficulties, should all be fruitful areas of further research. We will pursue these areas, and others, as our time and research budget allow.

Appendix A:

The subject defined his general heuristic in this way:

A. Randomly search, using chute-scanning, digit repetition, or geometry-based visual cues likely identified from memory due to pattern recognition, for the placement of either a number or an either/or notation for a single digit in a single region. Once a spot to write something is found, this is pushed onto a mental "stack" of information while the physical writing is occurring.
B. When there is a stack, consider the top-most position of the stack, searching for consequences in the row/column/box where the progress was made. Consequences include positive inferences about the digit on the stack, as well as negative inferences about other digits in the row/column/box. If a new spot to write a digit or note is found, this is pushed onto the stack while the writing is occurring.

Additional Considerations:
1. If, while in B mode, no consequences are found for the top element in the stack, that element is popped off the stack and solving continues in B or A mode depending on the presence of any other digits/notes in the stack.
2. When returning to A mode, some effort is made to not perform redundant searches given a cache of what has been searched in the past.
3. If in A for a sufficient length of time n (n = 30 to 180 seconds), enter a bifurcation at a spot with a high density of notes to return to B mode. The value of n depends on the expected difficulty of the puzzle, and is often correlated to the "trust" the subject has in the organizers of a competition.

While this heuristic is powerful, the random nature of the A mode - which can also be described as "what catches his eye first" - is likely to vary from iteration to iteration on a single puzzle as observed in the data. Most of the observed differences in time is likely from total A mode time, and not total B mode time, although it is hard to accurately tally either value for this subject.

Appendix B:

A - 000506017000901036000000000000200090059010800068000000000400020032000000076090400
B - 000800090000035706000021805092600070800009003047000000016200040400006001059000000
C - 000509000003801400060000002630000045570000089000000000004307200000105000010000003
D - 000803049000200010740000000000405061000600090150000000002030500003020100860000000
E - 000200800000500400000003007000470620907020030080000000000340910209060040010000000
F - 700030000080500000050100000018904000004006012000000600000000800061403000009005074
G - 000406509000501802000090030306080020090300100702000000070800400503000000408060070
H - 900410800010850009002000010006000050100980700070340006730000601690000304000000000
I - 400020061000000900760038000800060053000000800620093000040009000007100000030004000
J - 090700058700008096001050000320906000470302000000000000007040000030800065900005027
K - 800003059030700000009080036000000000601028000302041000050200000004030098200004061
L - 804093000072501000030800020080600090306024000027908000100080000000000104000000506
M - 000604000006000008805000063000902000040050300060070100000209000004000009503000027
N - 000000037070050096600800000020060013800200000000000048403605000000000000501903000
O - 001070004040902500000801000030000700504000061902000035020409800000607000003050006
P - 040070050000103000080090060400000300507000901000301000000408000800000200702000508
Q - 107000032000000000203000098040000600008405007500701080005104003090000800400208070
R - 600900000500200000010030000802601000003002140000000005407306000000000007006008420
S - 000000000608407000302601000003900000000000640020010750005200000000000560010090280
T - 000507803074000000000400006500010020800030060047000000000900002000302908026000000
U - 000007900000024108890000000002700040980000000001600030000052406000006500670000000
V - 000001050000720309000590804900070500043006080017000000100030200084009030026000000
W - 070608020600000500000703000000000000903000602204000803500000100000509000010304060
X - 000000045500007800070300026703095000602013000010800000100009300060400072000000098
Y - 000005800000600004000003500000097380001000000320010090490050070000069130008000000
Z - 000005060907080003603900800000008020108090006706400300000000000000820401000140907

Crook, J. F. "A Pencil-and-Paper Algorithm for Solving Sudoku Puzzles". Notices of the AMS (56, No. 4), 460-468 (2009). Link

Russel, E. and Jarvis, F. "Mathematics of Sudoku II", self-published article, (2006) Link
Also see "There are 5472730538 essentially different Sudoku grids ... and the Sudoku symmetry group" - web-published article (2005) - Link and other enumerations by Jarvis and coworkers.

Stuart, A. C. "Sudoku Creation and Grading" - self-published article, (2007) Link

Stuart, A. C. "The Logic of Sudoku" - Michael Mepham Publishing (2007). Link

Stuart, A. C. Scanraid Sudoku Solving Engine (2005-9) Link

Articles on the Varshavsky Case from motris.livejournal.com:

The Cat in the Hat Strikes Back? (an Open Letter to the Sudoku community and the organizers of the Sudoku National Championship about the potential cheating of Eugene Varshavsky during this Saturday's tournament) (10-25-2009) Link

Some Picture Evidence for the Sudobomber (10-26-1009) Link

The Sudobomber story develops... (10-27-2009) Link

Articles on the Varshavsky Case from the Philadelphia Inquirer:

"Sudoku prizes frozen in cheating probe" (10-28-2009) by John Timpane Link

"Sudoku contest competitor to be retested" (11-11-2009) by John Timpane Link

"3d-place winner disqualified in Sudoku Scandal" (11-24-2009) by John Timpane Link
(Anonymous) on November 26th, 2009 06:59 pm (UTC)
One trillion, you say...
Given a sudoku puzzle S, it is guaranteed that a single transformation of exchanging rows, rotating, or transposing will
yield a puzzle S' which is different from S (although S and S' will be isomorphic. It is not guaranteed that, given S and two transformed versions S' and S", that S' and S" will be distinct. The sudoku group of transformations needs to be better understood (as well as its action on the set of sudoku puzzles) before a claim that any isomorphism class has at least one trillion members in it can be made. An easy lower bound is 362880 through symbol replacement, and this can be improved upon, but not enough is known about the symmetries of an arbitrary puzzle.

Also, the claim that puzzle memory can be elimnated after a period of time as short as a day is suspect. The methodology for evaluating the ten outliers should have used a digit transposition on the ten puzzles to further challenge the effect memory might have on the solution times.

Of course, trailblazing research will point the way to better methods for evaluating the question under consideration, and point to new questions to be investigated. When this study is peer reviewed, may it inspire others to do more thorough and illuminating studies in similar vein.

Gerhard Paseman, 2009.11.27
motrismotris on November 26th, 2009 08:45 pm (UTC)
Re: One trillion, you say...
While it is true that the 9! * 6^8 * 2 potential transformations of a sudoku grid will not always lead to different results, the upper-limit of 1.218998E12 is rather close to the average value estimated by dividing the previously known total number and "essentially different" numbers given above as 1.218935E12. While the count for individual puzzles may vary, in almost all cases the count will be exactly 9! * 6^8 * 2 or 1.2 trillion if the research values reported by Russel and Jarvis are correct.

For this test subject, who has both a real job and limited sleep capabilities, short-term memory of a puzzle tested in this manner is unlikely given a roughly 40 hour period of time, including two periods of resting, between tests. A storage in intermediate or long-term memory would be required and the results do not speak to any such memory. Other solvers, such as those with photographic memories, may simply be able to remember a solution they have done before. This subject was not trying to remember any puzzle he was doing.

The subject will admit that, after solving over ten thousand sudoku, there are about 10 or 12 puzzles he really knows (such as onigame's Q) and that there likely is sufficient long-term memory of these puzzles that if one of them was presented to him, AND he could identify it as a likely Q', he might save some time on the essential steps. The challenge of identification for novelty puzzles stored in long-term memory may be worth exploring.

The authors of the study are willing to assist other testers to follow up on the conclusions made here, both by providing blinded sets of randomized puzzles for testing, and by solving blinded sets of puzzles sent by other investigators who want to further query the subject.
(Anonymous) on November 26th, 2009 07:03 pm (UTC)
Uh, and another thing...
Don't they teach at number placement university how to include a bibliography? tsk.

Gerhard Paseman, 2009.11.27
motrismotris on November 26th, 2009 07:14 pm (UTC)
Re: Uh, and another thing...
While I hate linking to the only published work I reference (Crook), I'll put together a biblio of links and post in the next half hour or so.
owens888owens888 on November 27th, 2009 08:17 am (UTC)
This is interesting work that would have been much strengthened by the addition of another test subject. In my field, you need to test at least 2 monkeys for a published paper, even if you record from hundreds of cells from each monkey and make each monkey do hundreds of trials. You have local access to at least one other monkey, I mean top sudoku-solver, and it is crucial that the results from a second subject are present to confirm the results from the first subject.

It is also possible that a few solvers have a very high variance in solving times. Your report does not quantify the rarity of this type of solver. Varshavsky may be this type of solver, and the retesting was not complete enough to say. I am still happy to get rid of his third-place position, though, both because circumstantial evidence suggests that he did in fact cheat and because it is questionable whether one should call a very high-variance person on a lucky day a "champion sudoku-solver."
motrismotris on November 27th, 2009 02:00 pm (UTC)
1) Very valid points. I would have thought more than 2 monkeys were needed. I tend not to give work to Wei-Hwa, but if he or anyone else is interested in solving a similar set of 25 or 26 puzzles for me, I can provide this for them. I don't believe a "proctor" is really needed so local access doesn't really matter here.

2) There is a style of very high variance solver that could exist. He would simply write random numbers in all the boxes, find if they work, and then erase them all to try again. Somewhere between every number being randomly entered and every number being deduced logically is where bifurcation at a solver's "fast" observation limit is actually what I consider for competitions the "best" of all possible strategies (relating to speed, not personal satisfaction). The fascinating aspect to me of Varshavsky's stage style and retest style is he was not recklessly guessing, to even allow the possibility he would fill a grid in 2-5 minutes to be out there. He was trying to solve them legitimately, but was solving them as if he'd done fewer than 5-10 in his lifetime. If the standard is even just filling in 81 cells, his "best" time even loses here.
(Anonymous) on November 27th, 2009 11:11 pm (UTC)
If you can provide them in regular format, I can set aside an evening to do them under controlled conditions, and report the results here. I expect my range will be between 150 seconds and 1500 seconds per puzzle. However, if I attempt the conversion, I will probably try to solve the puzzle during the conversion, which would skew my solving times.

I can also flip the pattern switch, i.e. I can attempt while doing the puzzle to compare it to the other puzzles. I promise to forget what has been said about which puzzles were isomorphic.

I can also attempt other observations during the process, if you tell me what you are interested in testing.

Gerhard Paseman, 2009.11.27
motrismotris on November 28th, 2009 12:17 am (UTC)
Thanks for the kind offer. You brought up in What If #1 the concept of trying to learn how to transform a grid to be "easier" or "harder". I didn't think this concept had merit as I didn't think transformation matters (for me), and so I decided to run this trial on myself to answer this for myself. Since you in part inspired this work (Eugene covered the rest), it would be nice to have your own data to confirm these results.

Indeed, this "study" (I use the term loosely still) would be well served with more testers, but this certainly means I should create some infrastructure to automate the grid strings going into a pdf to email out, for example. We should also have a discussion about what is a good source of puzzles to randomize. I hardly explore the whole range of difficulties out there - I chose a range that seemed appropriate for the Sudoku National Championship since I felt a statistical analysis of the challenge of "retesting" was useful, regardless of what happened with Eugene's retest - but depending on the goals a different set of sudoku to cover a wider range of potential difficulties, or sudoku targeted to have a particular kind of sticking point, might also be interesting. There are also some questions you've brought up about experimental design and this discussion should go on before I try to recruit more volunteers. For example, should I send out puzzles in two waves of say 20 each time. Should I have 5-10 puzzles exactly duplicated in the second set, but have that solved a week later without informing the solver they are getting a duplicate? Should there be m trials of n puzzles (m=n), or is a higher m or higher n a better starting point. If the specific hypothesis to test is "does an exact duplicate without memory solve the same as an isomorphic grid", a different set-up may work better.

I have my opinions on what would be best, but I welcome advice on changes in approach before anyone actually tries to make this "study" into something truly academic.
nathan_0 on December 18th, 2009 07:09 pm (UTC)
Great post
Really enjoyed this one, Thomas.

I knocked out the 26 puzzles, here are my average times:

HJKQZ 4:09 (3:32 to 5:46)
BGOVX 4:50 (3:37 to 7:12)
EFIRY 5:56 (4:33 to 7:46)
ACNSW 6:58 (3:53 to 10:39)
DMPTU 6:11 (4:30 to 8:16)
L 6:09 (6:09)

Not only slower than yours, but significantly looser w/r/t the ranges. I'll probably redo all 26 of them on Monday.

motrismotris on December 18th, 2009 07:36 pm (UTC)
Re: Great post
If you do, it would be cool to get all the times to be able to compare, particularly to see if your outliers like the 10:39 time relative to the 3:53 time on the one series are more similar or dissimilar the next time around.

The next time I run this experiment (most likely with pencil-mark less solving versus notes), I'll have an Adobe Illustrator script to make pdfs of all the puzzles so testing by others should be easier.
(Anonymous) on May 6th, 2010 07:49 pm (UTC)
very interesting. love sudoku, great way to develop concentration and focus. very addictive