About four weeks ago, a former sudoku champion rang a bell that cannot be unrung by claiming a fellow finalist was a cheater. While a lot of circumstantial evidence supported this case, no smoking gun was likely to be found from looking at video and photography from the event itself. Sudoku-related curiosities such as "unreasonable sticking points" or "missing pencilmarks" were unlikely to satisfy all parties necessary to disqualify the competitor. However, the tournament organizers came up with a process that could clarify matters: the contestant would be offered a second chance to solve more sudoku puzzles under closer watch to demonstrate his proficiency and to deflect the allegations of cheating. This leads to this week's topic:

What if you had to conduct a retest for sudoku? What kind of standardized puzzles or times would suffice to conclusively prove either the ability or inability to finish three puzzles in 14 minutes? How many puzzles would be needed and what would these be? Can sudoku be made "retestable" to indeed prove Eugene is detestable?

These questions necessarily lead to a discussion of mean and variance and nuances of the solving process; a fitting answer requires a careful analysis of how an actual human solver performs from puzzle to puzzle in a controlled setting. Fortunately, I know an eccentric scientist with an advanced degree in "Number Placement" who has a relevant manuscript to share with you:

We generated five isomorphic copies of five sudoku puzzles using digit reassignment, row/column/chute swapping, and rotation/reflection. A sixth puzzle with an unusual solving property was also transformed and added to the set for twenty-six test puzzles. Our subject, an advanced division contestant at the recent Sudoku National Championship, was able to identify the grid with the unusual solving property from the set of 26 puzzles, and also identified several commonalities within the remaining isomorphic puzzles. Considering his solution times, a 15-20% standard deviation in time was observed across the isomorphic versions of each puzzle. The fastest/slowest grid outliers were then resolved without further transformation. The subject's results suggest that the likely mean and standard deviation in solving time on an exact duplicate is very close to that of a transformed grid.

We also examined how transformations would affect computer ratings, specifically those of Scanraid Sudoku. Both number ordering and top-to-bottom, left-to-right biases were observed in the grading of isomorphic puzzles. In one extreme case, a 90 degree rotation of a puzzle led to a 50% increase in the rating for the puzzle. While the varied ratings may apply to the most esoteric human solvers, who persist with perfect top to bottom or 1 to 9 scanning, these ratings in general seem inconsistent with the expected human solving experience of sudoku.

Sudoku is a logic-based constraint-satisfaction puzzle, most typically presented on a 9x9 grid of squares that must be filled with numbers from 1-9 such that no number repeats in any row, column, or 3x3 bold region; other sizes and the use of symbols other than numbers in the cells are possible. The regions add a third constraint onto each cell which separates sudoku from Latin Square-based puzzles with solely row and column constraints. Despite some reports to the contrary [Crook, 2009], sudoku puzzles must have a single solution to be considered valid; valid sudoku with as few as 17 given numbers are known, although this has not yet been proven to be the lower bound.

Following years of research and individual experiences with making deductions in sudoku, a number of strategies have been formalized for their solution [see, for example, Scanraid Sudoku solver; Stuart (2005-9), also Stuart (2007)]. Many of the basic strategies fit into the categories of either positive inference or negative inference depending on whether the presence, or absence, of particular numbers is used in the placement of further digits. Different solvers may vary in their preferences for or abilities with positive versus negative inference which presents a challenge for the "retesting" of a competitor, as in the recent Varshavsky case [Philadelphia Inquirer, 2009], as a high variance in solving times might be encountered by a particular solver if presented a random sudoku to solve in under 5 minutes. Additionally, the breadth of a solution path can also influence the solving time, with a smaller variance of times expected for a puzzle with many places to identify digits by logical means compared to those with fewer such places. Given the puzzle by puzzle variation seen in sudoku, single puzzle finals (and playoffs in general) are not viewed by many members of the advanced solving community as an appropriate way to select the best overall sudoku solver [see motris.livejournal.com].

An individual sudoku puzzle can be transformed into a large number of isomorphic grids using number reordering, row/column/chute swapping, and rotations/reflections. For this reason, while 6,670,903,752,021,072,936,960 possible solutions have been reported for sudoku, only 5,472,730,538 are considered "essentially different". [Russel, Jarvis, 2005 and 2006] It is speculated, but has not yet been demonstrated, that isomorphic puzzles can lead to highly variable solution times due to various esoteric tendencies of the solver. A solver who approaches sudoku in a cyclic 1-9 search order may be affected by reassigning the values of the givens. Similarly, a solver who starts from the top to bottom or from the left to the right could be affected by row/column/chute swapping as well as rotations or diagonal mirroring. However, despite these nuances of the solution process, the essential steps along the solution path will be unchanged in these isomorphic grids. If one had to retest a competitor, isomorphic forms of the puzzle might be valid for such a test. If the competitor had, for example, memorized the solution of a particular puzzle (even in an extreme case where he has memorized an order of digits to enter, perhaps from receiving a copy of the puzzle before the test and needing to reproduce a "solution path" while under increased scrutiny), an isomorphic version of the same puzzle will be protected from such an attack as it will be impossible to memorize the approximately one trillion isomorphic grids that are possible for any individual puzzle. Further, if a solver claims he is better at, say, negative inference - excelling at naked singles but not hidden singles for example - and tries to use this to explain a much slower time for a particular puzzle, the use of an isomorphic grid containing an identical number of potential starting points for a particular kind of logic rule will negate any "variation of solving method" rebuttal for a poor retest. (Of course, according to Thomas Snyder [personal communication], the most fascinating element of the Varshavsky case was that after 4 weeks of being under suspicion, the competitor actually showed up for a retest without any practice to make it remotely possible he could deceive the invigilators, even on identical puzzles to the ones under suspicion! We are therefore considering the case of catching smarter cheaters, with an altered ratio of intelligence to hubris, in this study).

Here, we have tested a set of puzzles randomly ordered following transformations to determine how consistent or inconsistent the solving process is. We also added a single puzzle with a particularly unusual solving property into the mix - and alerted the subject to its presence - to see if the step would be encountered on his solving path and identifiable amidst many other puzzles.

A perl script was written to output a specified number of isomorphic forms of an input classic 9x9 sudoku puzzle by performing number replacement, then row and horizontal chute permutations, then column and vertical chute permutations, and then possibly performing a rotation/reflection to reach the final altered state. 5 puzzles (3 of approximately 4 star difficulty, 2 of approximately 5 star difficulty, all appropriate to reward logical methods instead of bifurcation using the identification of pairs and triples as the hardest steps by Scanraid analysis) were selected for analysis and five versions of each were generated. A sixth puzzle with an unusual solving property, specifically an Order 5 Franciscan Rectangle that fits into a class of "uniqueness" techniques, was chosen and randomized as well. This advanced technique is not required to solve the puzzle, but is a potential shortcut an advanced solver might find and exploit. All puzzles had between 24 and 28 givens and the number of givens was not unique for any of the six puzzles.

The 26 puzzles (now labeled A-Z) were printed on 4.5" by 4.5" square grids, identical to the size used at the Sudoku National Championship, and then solved in batches of 5 (in one case 6) puzzles at a time with short 5 minute breaks. A recorded, pre-watched, episode of "Sportscenter" was playing in the background throughout the testing as a kind of white noise. The subject, selected for being a high-finishing participant in many sudoku championships, chose to solve the puzzles while seated in a comfy chair, writing against a notebook surface with .9 mm Pentel Twist-Erase III pencils. The subject timed himself on the solution of these puzzles, and the finished grids were checked for completeness of filling, but not checked for accuracy. From anecdotal evidence provided by the subject, about 95% of puzzles will be correctly solved in this manner so, in this study, approximately 1 puzzle may have an error such as a transposition.

In addition to the human testing, the puzzles were also scored in Scanraid Sudoku to determine how "computer rating" itself varies as a result of grid transformations.

The 26 puzzles are provided in Appendix B.

The subject's initial solve data (and paper notes) follow:

A - 3'22", B - 2'39", C - 3'56", D - 3'57" (franciscan-like), E - 3'41", F - 2'40", G - 3'3", H - 1'48", I - 2'12", J - 2'2", K - 2'27", L - 3'31" (Order 5 Franciscan), M - 3'18", N - 3'1", O - 1'52", P - 4'21" (franciscan-like), Q - 1'39", R - 1'53", S - 3'16", T - 3'17" (franciscan-like), U - 3'41" (franciscan-like), V - 3'7", W - 2'53", X - 2'6", Y - 3'35", Z - 2'1"

The subject completed the 26 puzzles in a range of times from as short as 99 seconds to as long as 261 seconds. The average solving time was approximately 2 minutes, 54 seconds, with a standard deviation of 46 seconds. The median solve time was approximately 3 minutes, 2 seconds. During solving, the subject never had to make an erasure or correction to his puzzles, which would lead to a higher than expected time for his solving process on that individual grid. If such an event had occurred, we would have suggested a future retesting of the same puzzle as "clean" solving times are of the most interest to us for comparison.

The subject also made notes on some of his puzzles. Given the expectation of a Franciscan puzzle of order 5, the subject related that "when I saw 4 similar pairs that could make such a situation, but didn't resolve properly, I noted 'franciscan-like' at the end of the solution process." For 4 solutions (D, P, T, U) this text was indeed written in the margins of that test paper. For a fifth puzzle, L, Order 5 Franciscan was written. Puzzle L is indeed the sudoku with an Order 5 Franciscan Rectangle. D, P, T, U (and M) are all isomorphic forms of another puzzle. It is unclear why the subject did not mark M as Franciscan-like but it is within a second of the fastest time over the trials for this puzzle meaning the critical steps may have been identified before the relevant pair property was noted in a period of being "stuck". It also occurred in the trial immediately after the Order 5 Franciscan grid (L) and could have been missed due to increased laxity after identifying the hidden sixth puzzle.

After the testing, the subject also noted to experimenters that "deja vu" feelings about the starting steps of puzzles would happen such as "I've chased this kind of digit by pointing pairs around the grid before." While we did not ask the subject to try to isolate each individual puzzle, the solver indicated that if he noted the kinds of starting steps, as well as sticking points, he could possibly have accomplished this assignment with some accuracy. Because of this alleged "memory" of a puzzle, we did consider whether the solver got "better" at a particular puzzle, even with isomorphism, as the trials went on.

Rearranging the 26 puzzles into the 6 unique types, and ordering by apparent difficulty, these are the subject's times in chronological order:

1 - HJKQZ - 1'48", 2'2", 2'27", 1'39", 2'1" - average time 1'59", deviation 18", range 1'39"-2'27"

2 - BGOVX - 2'39", 3'3", 1'52", 3'7", 2'6" - average time 2'33", deviation 33", range 1'52"-3'3"

3 - EFIRY - 3'41", 2'40", 2'12", 1'53", 3'35" - average time 2'48", deviation 48", range 1'53"-3'41"

4 - ACNSW - 3'22", 3'56", 3'1", 3'16", 2'53" - average time 3'18", deviation 24", range 2'53"-3'56"

5 - DMPTU - 3'57", 3'18", 4'21", 3'17", 3'41" - average time 3'43", deviation 27", range 3'17"-4'21"

6 - L - 3'41"

The tightest solution times exist for the puzzle with the fastest solution, but the next two tightest distributions are for the two puzzles that respectively take the longest. When sorted by average solve time, both the fastest time on a puzzle, and the slowest time on a puzzle (the "range") are also in the indicated 1-5 order. Considering the relative position of the fastest versus the slowest solve, for these three puzzles the fastest solutions were #4, 3, 4, 5, and 4 (average of 4.0) and the slowest were #3, 4, 1, 2, 3 (average of 2.6) which suggests a slight possibility of some learning throughout the process although in a few cases, as on the puzzle with the highest variance (3), a very fast fourth solution was followed by a much slower fifth solution (by 102 seconds) that was only slight ahead of the overall slowest solution so a matter of a few seconds would change a 1 to a 5 and affect this averaging. Another hypothesis that accounts for the results is that, after solving several puzzles, the subject had "warmed up" and was more likely to achieve faster times.

Following a day off, the outliers - the 10 puzzles with the fastest and slowest respective times - were resolved in random order to determine if these differences were based on variations in the grids themselves, or by the stochastic nature of the subject's solution process. This test would also help determine if the "exact" puzzles have more consistent times than permuted grids.

Q - 1'39" first time, 1'47" second time, +8 seconds difference

K - 2'37" first time, 2'55" second time, +18 seconds difference

O - 1'52" first time, 2'40" second time, +48 seconds difference

V - 3'7" first time, 2'14" second time, -53 seconds difference

R - 1'53" first time, 2'9" second time, +16 seconds difference

E - 3'41" first time, 2'40" second time, -61 seconds difference

W - 2'43" first time, 2'43" second time, 0 seconds difference

C - 3'56" first time, 2'23" second time, -93 seconds difference

T - 3'17" first time, 3'38" second time, +21 seconds difference

P - 4'21" first time, 2'45" second time, -96 seconds difference

In 8 of the cases, the second time on a puzzle moved towards the mean by being slower for the fastest puzzle or faster for the slowest puzzle. In one case the exact time reoccurred; in the last case, a slightly slower time was observed for the slowest time. In 3 of the 5 pairs of puzzles, the slower puzzle in the first trial was solved faster than its partner in the second iteration. In 2 cases, for C and P, the worst solution time actually led to a new fastest solving time. Collectively, these results suggest that the outliers were not perceived as easier/harder by the solver consistently but were rather the result of random variation in the subject's search process from puzzle to puzzle. Given the magnitude of the time differences in the retest, we also conclude that the variation between "exact" duplicates of a puzzle is close to identical to the variation from transforming the grid. The day off effectively eliminated memory of an individual puzzle so that it behaved as an isomorphic grid. The value of using isomorphic grids is that an identical puzzle can be tested almost immediately after another one whereas we do not expect back-to-back solutions of an exact duplicate would behave in the same way.

The 26 puzzles were then evaluated by Scanraid Sudoku to determine their "grades".

1 - HJKQZ - 74, 75, 73, 74, 75 - average 74.2, deviation .8

2 - BGOVX - 92, 102, 113, 108, 94 - average 101.8, deviation 9.0

3 - EFIRY - 154, 104, 104, 109, 150 - average 124.2, deviation 25.5

4 - ACNSW - 165, 132, 147, 156, 137 - average 147.4, deviation 13.5

5 - DMPTU - 139, 139, 147, 130, 143 - average 139.6, deviation 6.3

6 - L - 96

Scanraid's score agrees on the "easiest" puzzles by rating, although it differs on the order of 4 and 5. 6 (the order 5 Franciscan rectangle) takes about as long in its single solve as the average time of the hardest puzzle, but scores much less by Scanraid. Intriguingly, the rating of some of the isomorphic grids, particularly E and Y, compared to F, I, and R, was grossly different. The subject's solving of the puzzles E and Y was also much slower than those for FIR for this highest variance case, well beyond expectation, which drew our interest.

To account for the variation observed between the isomorphic puzzles E and F, we first checked our software to confirm the puzzles were isomorphic and indeed they were. We then considered the relative Scanraid solving path for E versus F as comparison.

E displays two instances of Singles in Row/Col, then two instances of Singles in Box, then one instance of Naked Pairs/Triples, then sixteen consecutive uses of show possibles (one choice cells) solves the rest.

F displays two instances of Singles in Row/Col, then one instance of Singles in Box, then one instance of Naked Pairs/Triples, then thirteen consecutive uses of show possibles solves the rest.

The first major difference in E and F occurs in the Singles in Box step, with an extra instance of this occurring for E but with the same total number of digits (2) being placed after this step concludes. In the specific case of E, a single (7) then allows another single (6) in the same box and this is done in two steps. For F, the isomorphic position in just one step uses a single (8) to then allow another single (9). Digit order therefore seems to be hard-coded into Scanraid's approach of step counting. Swapping 6 and 7 in E does not change the overall rating from 154, but does reduce the number of iterations of Singles in Box from two to one to be more F-like.

The next major difference in the two solution paths is in the Naked Pairs/Triples step. Unfortunately, hidden pairs and triples are more common than naked pairs and triples for most solvers (and are identified one at a time, not en masse), but Scanraid identifies a slew of naked pairs/triples in a single step. In E, 1 pair (eliminating 9 candidates) and 6 triples (eliminating 16 candidates) are identified. In F, 1 pair (eliminating 9 candidates) and 7 triples (eliminating 17 candidates) are identified. The extra triple (and extra missing candidate) change the solving profile slightly at the end with 3 fewer "one option only" steps needed. This faster solution path must account for the relative difference in score.

To further examine how different transforms affect the Scanraid rating, distinct transformation types were generated and rescored for the grids E and F. Varying just the number identity of either E or F led to only small changes in the score (across five grids E varied from 151 to 163 and F varied from 104 to 113). Varying the row/horizontal chute ordering had a much larger effect on the score (E varied from 105 to 154 over five grids; F varied from 104 to 151). This shows that geometry (perhaps in top/bottom or left/right order) is also a meaningful difference in the scoring of puzzles by scanraid. Taking a mirror (across the main upper-left to lower-right diagonal) of E turns a 154 rating into a 108 rating. This transformation is minor so the large change in rating seems astounding and speaks further to top/bottom or left/right bias. Similarly, just rotating E 90 degrees counter-clockwise turns a 154 rating into a 108 rating. Since the numbers are just symbols and an 8 can be seen as an infinity symbol if looked at sideways, this is not a meaningful transformation at all but it accounts for the entire magnitude of the difference in rating between E and F. The 90 degree rotation also changes the 1 pair and 6 triples for 25 eliminated candidates step of E into the 1 pair and 7 triples for 26 eliminated candidates step seen for F.

While Scanraid does offer some value in rating the fastest versus the slowest puzzles for this subject, we do not believe, within isomorphisms of a given puzzle, that the variation in Scanraid scores is meaningfully correlated with the experience of the tested subject. We posit that isomorphic grids should receive the same/similar scores from an ideal computer rater, but minor effects including digit order and top/bottom or left/right bias in this particular algorithm lead to large differences in rating (matching the untested hypothesis that a solver that goes in 1-9 order may find numbers at a different rate). An ideal rating system may require both a reordering of steps, as well as a consideration of solving path breadth. While Stuart has not divulged all the methodologies of his system [Stuart, 2007], his scoring has been evaluated using thousands of self-reported solving times in 5 or 10 minute wide ranges. As this study focuses on an "advanced" solver, these times would not be relevant for marking difficulty for this solver as the smallest reported time (< 5 minutes) accounts for all of the solutions here. It is an open question whether "beginner" or "intermediate" solvers are making use of the same techniques as an "advanced" solver, but with less efficiency and a constant scaling factor, or if their techniques are less developed and have their properties. A further consideration for this Scanraid scoring is that, if the self-reporting solvers are using parts of the Scanraid engine to solve the puzzles, which can present all candidates to a solver in its web-interface and printouts, the difficulty of naked singles (and pairs and triples) will be biased for these times relative to standard paper and pencil solving where solvers do not begin with candidate information and this information does not get updated during the solution process.

In contrast to Scanraid's approach, the given subject's style (see appendix A for a description of his heuristic) seems more tuned to individual occurrences of digits and geometry, and is therefore blind to most of these changes. The "deja vu" feel of certain puzzles at the start or end, as well as the identification of 4 puzzles out of 5 that felt close to the Franciscan Order 5 at some stage, suggests that many commonalities in solving path are found despite the highly transformed grids. Still, in any single trial, slight differences in searching pattern, particularly on the puzzles with the narrowest solving path, can lead to variations in times. On average, a standard deviation of about 15-20% in solve times seems likely on any single puzzle at these difficulties, although in the extreme a 2x time might be observed. At these difficulties, similar to those used by the Sudoku National Championship, all puzzles are being solved in less than 4 and a half minutes. Retesting with transformed grids should not lead to extremely high variations in round time.

For this finisher of three puzzles in 9-12 minutes, being retested and taking more than 16 minutes is not expected from this study. Consider if puzzles 3-5 constituted a round of the competition. The "best possible day" for the subject, taking his fastest times across all iterations, would allow for the round to be finished in 8 minutes, 3 seconds. The mean times for the subject would lead to completing the round in 9 minutes, 49 seconds. The "worst possible day" for the subject would leave a round time of 11 minutes, 58 seconds. The gap of "best" to "worst" day is less than a 50% increase of the fastest times. The gap of "best" to "mean" is less than 20%. Of course, making an error while solving a sudoku puzzle which requires full erasure of a grid would lead to much worse times than any of the instances observed here. Such errors occur all the time, but for identifying what a solver can do these "scrub" times should not be used.

In the Varshavsky case, having observed a single data point of finishing 3 specific puzzles in 14 minutes, we can now begin to consider what upper limit of solving time would be consistent with a legitimate competitor. Consider retesting the suspect on some number of identical (meaning isomorphic) puzzles from one to one trillion for the three round puzzles (~3 of each seems reasonable). These trials would establish both a "mean" and "best" time for a solver. Certainly, if the combined "mean" over three puzzles came within a minute of or even bested the observed time, this would remove suspicion from this solver. A "best" time that does not approach even a reasonable "mean" time, such as one 50% higher than the competition time or 20 minutes, would instead suggest cheating had occurred. Without actual proof of a method of cheating, organizers could allow the suspect to continue solving the 1 trillion possible puzzles until his "best" time could beat the 14 minute time, but by then his "mean" time and variance will be very well established allowing a less biased estimate of the probability that a legitimate solving approach was used during the competition.

In performing retesting with different solvers, it is likely that certain solving styles, particularly those with a high level of bifurcation/guessing, will have a particularly high variance in solving time. This study does not contain enough participants to report on such cases. Still, consider an extreme case in which a very difficult puzzle has an obvious bifurcation spot which will lead to a solution or not. The times for this puzzle will be bimodal, say 4 or 6 minutes per puzzle, tied to a coin-flip. A lucky run of three heads would lead to a 12 minute round while an unlucky performance would lead to an 18 minute round. Neither the "mean" time in this situation, nor the "best" time, would eventually come under scrutiny, particularly as it will be obvious by direct observation that the solver employs bifurcation, has high variance in his times, and may therefore just need a 1 in 8 chance of being lucky in each of three puzzles to qualify for the stage. Indeed, direct observation of the solver during retesting, to classify his solving heuristics in addition to assessing his speed, will clarify the results of invigilation.

All considered, whatever method Varshavsky used in Philadelphia to accomplish a 14 minute time in round 3 was not resistant to discovery after retesting, although without his failure to perform even adequately in the finals he would have escaped undetected at this championship under current protocol. Following retesting, where finishing even a single puzzle in 30 minutes was unlikely, it was not a hard decision to disqualify. Consider instead the more difficult case of a competitor who, on significant retesting, can only achieve a best time of 20 minutes, with ~6.5 minutes being the fastest he can solve any puzzle with an average of 8.5 minutes per and 1 minute standard deviation. One possible cheating strategy, relatively low-tech but still well-known in many testing situations, is to situate oneself near a more talented solver. Given the two-page booklet used in these championships, and the fact most solvers start on puzzle one and then just slide the booklet over to solve the second puzzle, imagine a cheater who solves puzzle two, then copies puzzle one from his neighbor to the right (taking about one minute which is a reasonable time for copying ~60 digits into a grid), and then solves puzzle three. This solver with a 6.5 minute "best" time per puzzle could then accomplish a 14 minute round by cheating, but under retesting could never come "close" in a probabilistic sense. A sufficiently stringent retesting procedure should prove the cheating even in this difficult case where a majority of the solving of a round was legitimate, but slower than other competitors. Ideally organizers would situate proctors and physical obstructions between competitors to reduce the opportunity for such an approach (or even a lesser variant of the approach where, after solving puzzle 3, a quick glance around after a "look at the clock" would allow for the copying of a handful of digits into both puzzles 1 and 2 to get a head-start on each).

In a world with many potential "Eugenes", with valuable prizes, and with increasingly miniaturized technologies that can assist cheating, it is reassuring to know that a relevant retesting approach may exist to identify cheaters if subjected to such testing. Particularly proactive organizers may make a mandatory "retest" a requirement for a podium finisher, similar to a drug test for a winning athlete. It is further reassuring to consider that basic steps such as handing out randomly permuted forms of the same puzzles to neighboring competitors could be used to remove/minimize the potential of the most common forms of cheating including prior knowledge of the test (there is no longer one test to memorize) and copying off a neighbor for sudoku. It is unlikely such steps are possible with almost any other kind of puzzle, including other logic puzzles or crossword puzzles.

While this study may seem like a rather extensive analysis for such a basic and fundamental result on puzzle-to-puzzle variation in solving times, it is the first study of its kind and should also serve as a good foundation for further studies that use isomorphic puzzles to explore the solving process. For example, differences in time for using "notes" versus "pencilmark-less" solving, differences arising from grid size both on paper and on whiteboards, and differences arising from the use of "reckless bifurcation" versus logical methods on puzzles of varying difficulties, should all be fruitful areas of further research. We will pursue these areas, and others, as our time and research budget allow.

The subject defined his general heuristic in this way:

Modes:

A. Randomly search, using chute-scanning, digit repetition, or geometry-based visual cues likely identified from memory due to pattern recognition, for the placement of either a number or an either/or notation for a single digit in a single region. Once a spot to write something is found, this is pushed onto a mental "stack" of information while the physical writing is occurring.

B. When there is a stack, consider the top-most position of the stack, searching for consequences in the row/column/box where the progress was made. Consequences include positive inferences about the digit on the stack, as well as negative inferences about other digits in the row/column/box. If a new spot to write a digit or note is found, this is pushed onto the stack while the writing is occurring.

Additional Considerations:

1. If, while in B mode, no consequences are found for the top element in the stack, that element is popped off the stack and solving continues in B or A mode depending on the presence of any other digits/notes in the stack.

2. When returning to A mode, some effort is made to not perform redundant searches given a cache of what has been searched in the past.

3. If in A for a sufficient length of time n (n = 30 to 180 seconds), enter a bifurcation at a spot with a high density of notes to return to B mode. The value of n depends on the expected difficulty of the puzzle, and is often correlated to the "trust" the subject has in the organizers of a competition.

While this heuristic is powerful, the random nature of the A mode - which can also be described as "what catches his eye first" - is likely to vary from iteration to iteration on a single puzzle as observed in the data. Most of the observed differences in time is likely from total A mode time, and not total B mode time, although it is hard to accurately tally either value for this subject.

Puzzles:

A - 000506017000901036000000000000200090059010800068000000000400020032000000076090400

B - 000800090000035706000021805092600070800009003047000000016200040400006001059000000

C - 000509000003801400060000002630000045570000089000000000004307200000105000010000003

D - 000803049000200010740000000000405061000600090150000000002030500003020100860000000

E - 000200800000500400000003007000470620907020030080000000000340910209060040010000000

F - 700030000080500000050100000018904000004006012000000600000000800061403000009005074

G - 000406509000501802000090030306080020090300100702000000070800400503000000408060070

H - 900410800010850009002000010006000050100980700070340006730000601690000304000000000

I - 400020061000000900760038000800060053000000800620093000040009000007100000030004000

J - 090700058700008096001050000320906000470302000000000000007040000030800065900005027

K - 800003059030700000009080036000000000601028000302041000050200000004030098200004061

L - 804093000072501000030800020080600090306024000027908000100080000000000104000000506

M - 000604000006000008805000063000902000040050300060070100000209000004000009503000027

N - 000000037070050096600800000020060013800200000000000048403605000000000000501903000

O - 001070004040902500000801000030000700504000061902000035020409800000607000003050006

P - 040070050000103000080090060400000300507000901000301000000408000800000200702000508

Q - 107000032000000000203000098040000600008405007500701080005104003090000800400208070

R - 600900000500200000010030000802601000003002140000000005407306000000000007006008420

S - 000000000608407000302601000003900000000000640020010750005200000000000560010090280

T - 000507803074000000000400006500010020800030060047000000000900002000302908026000000

U - 000007900000024108890000000002700040980000000001600030000052406000006500670000000

V - 000001050000720309000590804900070500043006080017000000100030200084009030026000000

W - 070608020600000500000703000000000000903000602204000803500000100000509000010304060

X - 000000045500007800070300026703095000602013000010800000100009300060400072000000098

Y - 000005800000600004000003500000097380001000000320010090490050070000069130008000000

Z - 000005060907080003603900800000008020108090006706400300000000000000820401000140907

Crook, J. F. "A Pencil-and-Paper Algorithm for Solving Sudoku Puzzles". Notices of the AMS (56, No. 4), 460-468 (2009). Link

Russel, E. and Jarvis, F. "Mathematics of Sudoku II", self-published article, (2006) Link

Also see "There are 5472730538 essentially different Sudoku grids ... and the Sudoku symmetry group" - web-published article (2005) - Link and other enumerations by Jarvis and coworkers.

Stuart, A. C. "Sudoku Creation and Grading" - self-published article, (2007) Link

Stuart, A. C. "The Logic of Sudoku" - Michael Mepham Publishing (2007). Link

Stuart, A. C. Scanraid Sudoku Solving Engine (2005-9) Link

The Cat in the Hat Strikes Back? (an Open Letter to the Sudoku community and the organizers of the Sudoku National Championship about the potential cheating of Eugene Varshavsky during this Saturday's tournament) (10-25-2009) Link

Some Picture Evidence for the Sudobomber (10-26-1009) Link

The Sudobomber story develops... (10-27-2009) Link

"Sudoku prizes frozen in cheating probe" (10-28-2009) by John Timpane Link

"Sudoku contest competitor to be retested" (11-11-2009) by John Timpane Link

"3d-place winner disqualified in Sudoku Scandal" (11-24-2009) by John Timpane Link

What if you had to conduct a retest for sudoku? What kind of standardized puzzles or times would suffice to conclusively prove either the ability or inability to finish three puzzles in 14 minutes? How many puzzles would be needed and what would these be? Can sudoku be made "retestable" to indeed prove Eugene is detestable?

These questions necessarily lead to a discussion of mean and variance and nuances of the solving process; a fitting answer requires a careful analysis of how an actual human solver performs from puzzle to puzzle in a controlled setting. Fortunately, I know an eccentric scientist with an advanced degree in "Number Placement" who has a relevant manuscript to share with you:

**Effects of Transformations of Sudoku Puzzles on Solution Times of an Advanced Solver**

A Case Study by Dr. Sudoku

Abstract:A Case Study by Dr. Sudoku

Abstract:

We generated five isomorphic copies of five sudoku puzzles using digit reassignment, row/column/chute swapping, and rotation/reflection. A sixth puzzle with an unusual solving property was also transformed and added to the set for twenty-six test puzzles. Our subject, an advanced division contestant at the recent Sudoku National Championship, was able to identify the grid with the unusual solving property from the set of 26 puzzles, and also identified several commonalities within the remaining isomorphic puzzles. Considering his solution times, a 15-20% standard deviation in time was observed across the isomorphic versions of each puzzle. The fastest/slowest grid outliers were then resolved without further transformation. The subject's results suggest that the likely mean and standard deviation in solving time on an exact duplicate is very close to that of a transformed grid.

We also examined how transformations would affect computer ratings, specifically those of Scanraid Sudoku. Both number ordering and top-to-bottom, left-to-right biases were observed in the grading of isomorphic puzzles. In one extreme case, a 90 degree rotation of a puzzle led to a 50% increase in the rating for the puzzle. While the varied ratings may apply to the most esoteric human solvers, who persist with perfect top to bottom or 1 to 9 scanning, these ratings in general seem inconsistent with the expected human solving experience of sudoku.

**Introduction:**Sudoku is a logic-based constraint-satisfaction puzzle, most typically presented on a 9x9 grid of squares that must be filled with numbers from 1-9 such that no number repeats in any row, column, or 3x3 bold region; other sizes and the use of symbols other than numbers in the cells are possible. The regions add a third constraint onto each cell which separates sudoku from Latin Square-based puzzles with solely row and column constraints. Despite some reports to the contrary [Crook, 2009], sudoku puzzles must have a single solution to be considered valid; valid sudoku with as few as 17 given numbers are known, although this has not yet been proven to be the lower bound.

Following years of research and individual experiences with making deductions in sudoku, a number of strategies have been formalized for their solution [see, for example, Scanraid Sudoku solver; Stuart (2005-9), also Stuart (2007)]. Many of the basic strategies fit into the categories of either positive inference or negative inference depending on whether the presence, or absence, of particular numbers is used in the placement of further digits. Different solvers may vary in their preferences for or abilities with positive versus negative inference which presents a challenge for the "retesting" of a competitor, as in the recent Varshavsky case [Philadelphia Inquirer, 2009], as a high variance in solving times might be encountered by a particular solver if presented a random sudoku to solve in under 5 minutes. Additionally, the breadth of a solution path can also influence the solving time, with a smaller variance of times expected for a puzzle with many places to identify digits by logical means compared to those with fewer such places. Given the puzzle by puzzle variation seen in sudoku, single puzzle finals (and playoffs in general) are not viewed by many members of the advanced solving community as an appropriate way to select the best overall sudoku solver [see motris.livejournal.com].

An individual sudoku puzzle can be transformed into a large number of isomorphic grids using number reordering, row/column/chute swapping, and rotations/reflections. For this reason, while 6,670,903,752,021,072,936,960 possible solutions have been reported for sudoku, only 5,472,730,538 are considered "essentially different". [Russel, Jarvis, 2005 and 2006] It is speculated, but has not yet been demonstrated, that isomorphic puzzles can lead to highly variable solution times due to various esoteric tendencies of the solver. A solver who approaches sudoku in a cyclic 1-9 search order may be affected by reassigning the values of the givens. Similarly, a solver who starts from the top to bottom or from the left to the right could be affected by row/column/chute swapping as well as rotations or diagonal mirroring. However, despite these nuances of the solution process, the essential steps along the solution path will be unchanged in these isomorphic grids. If one had to retest a competitor, isomorphic forms of the puzzle might be valid for such a test. If the competitor had, for example, memorized the solution of a particular puzzle (even in an extreme case where he has memorized an order of digits to enter, perhaps from receiving a copy of the puzzle before the test and needing to reproduce a "solution path" while under increased scrutiny), an isomorphic version of the same puzzle will be protected from such an attack as it will be impossible to memorize the approximately one trillion isomorphic grids that are possible for any individual puzzle. Further, if a solver claims he is better at, say, negative inference - excelling at naked singles but not hidden singles for example - and tries to use this to explain a much slower time for a particular puzzle, the use of an isomorphic grid containing an identical number of potential starting points for a particular kind of logic rule will negate any "variation of solving method" rebuttal for a poor retest. (Of course, according to Thomas Snyder [personal communication], the most fascinating element of the Varshavsky case was that after 4 weeks of being under suspicion, the competitor actually showed up for a retest without any practice to make it remotely possible he could deceive the invigilators, even on identical puzzles to the ones under suspicion! We are therefore considering the case of catching smarter cheaters, with an altered ratio of intelligence to hubris, in this study).

Here, we have tested a set of puzzles randomly ordered following transformations to determine how consistent or inconsistent the solving process is. We also added a single puzzle with a particularly unusual solving property into the mix - and alerted the subject to its presence - to see if the step would be encountered on his solving path and identifiable amidst many other puzzles.

**Materials and Methods:**A perl script was written to output a specified number of isomorphic forms of an input classic 9x9 sudoku puzzle by performing number replacement, then row and horizontal chute permutations, then column and vertical chute permutations, and then possibly performing a rotation/reflection to reach the final altered state. 5 puzzles (3 of approximately 4 star difficulty, 2 of approximately 5 star difficulty, all appropriate to reward logical methods instead of bifurcation using the identification of pairs and triples as the hardest steps by Scanraid analysis) were selected for analysis and five versions of each were generated. A sixth puzzle with an unusual solving property, specifically an Order 5 Franciscan Rectangle that fits into a class of "uniqueness" techniques, was chosen and randomized as well. This advanced technique is not required to solve the puzzle, but is a potential shortcut an advanced solver might find and exploit. All puzzles had between 24 and 28 givens and the number of givens was not unique for any of the six puzzles.

The 26 puzzles (now labeled A-Z) were printed on 4.5" by 4.5" square grids, identical to the size used at the Sudoku National Championship, and then solved in batches of 5 (in one case 6) puzzles at a time with short 5 minute breaks. A recorded, pre-watched, episode of "Sportscenter" was playing in the background throughout the testing as a kind of white noise. The subject, selected for being a high-finishing participant in many sudoku championships, chose to solve the puzzles while seated in a comfy chair, writing against a notebook surface with .9 mm Pentel Twist-Erase III pencils. The subject timed himself on the solution of these puzzles, and the finished grids were checked for completeness of filling, but not checked for accuracy. From anecdotal evidence provided by the subject, about 95% of puzzles will be correctly solved in this manner so, in this study, approximately 1 puzzle may have an error such as a transposition.

In addition to the human testing, the puzzles were also scored in Scanraid Sudoku to determine how "computer rating" itself varies as a result of grid transformations.

The 26 puzzles are provided in Appendix B.

**Results:**The subject's initial solve data (and paper notes) follow:

A - 3'22", B - 2'39", C - 3'56", D - 3'57" (franciscan-like), E - 3'41", F - 2'40", G - 3'3", H - 1'48", I - 2'12", J - 2'2", K - 2'27", L - 3'31" (Order 5 Franciscan), M - 3'18", N - 3'1", O - 1'52", P - 4'21" (franciscan-like), Q - 1'39", R - 1'53", S - 3'16", T - 3'17" (franciscan-like), U - 3'41" (franciscan-like), V - 3'7", W - 2'53", X - 2'6", Y - 3'35", Z - 2'1"

The subject completed the 26 puzzles in a range of times from as short as 99 seconds to as long as 261 seconds. The average solving time was approximately 2 minutes, 54 seconds, with a standard deviation of 46 seconds. The median solve time was approximately 3 minutes, 2 seconds. During solving, the subject never had to make an erasure or correction to his puzzles, which would lead to a higher than expected time for his solving process on that individual grid. If such an event had occurred, we would have suggested a future retesting of the same puzzle as "clean" solving times are of the most interest to us for comparison.

The subject also made notes on some of his puzzles. Given the expectation of a Franciscan puzzle of order 5, the subject related that "when I saw 4 similar pairs that could make such a situation, but didn't resolve properly, I noted 'franciscan-like' at the end of the solution process." For 4 solutions (D, P, T, U) this text was indeed written in the margins of that test paper. For a fifth puzzle, L, Order 5 Franciscan was written. Puzzle L is indeed the sudoku with an Order 5 Franciscan Rectangle. D, P, T, U (and M) are all isomorphic forms of another puzzle. It is unclear why the subject did not mark M as Franciscan-like but it is within a second of the fastest time over the trials for this puzzle meaning the critical steps may have been identified before the relevant pair property was noted in a period of being "stuck". It also occurred in the trial immediately after the Order 5 Franciscan grid (L) and could have been missed due to increased laxity after identifying the hidden sixth puzzle.

After the testing, the subject also noted to experimenters that "deja vu" feelings about the starting steps of puzzles would happen such as "I've chased this kind of digit by pointing pairs around the grid before." While we did not ask the subject to try to isolate each individual puzzle, the solver indicated that if he noted the kinds of starting steps, as well as sticking points, he could possibly have accomplished this assignment with some accuracy. Because of this alleged "memory" of a puzzle, we did consider whether the solver got "better" at a particular puzzle, even with isomorphism, as the trials went on.

Rearranging the 26 puzzles into the 6 unique types, and ordering by apparent difficulty, these are the subject's times in chronological order:

1 - HJKQZ - 1'48", 2'2", 2'27", 1'39", 2'1" - average time 1'59", deviation 18", range 1'39"-2'27"

2 - BGOVX - 2'39", 3'3", 1'52", 3'7", 2'6" - average time 2'33", deviation 33", range 1'52"-3'3"

3 - EFIRY - 3'41", 2'40", 2'12", 1'53", 3'35" - average time 2'48", deviation 48", range 1'53"-3'41"

4 - ACNSW - 3'22", 3'56", 3'1", 3'16", 2'53" - average time 3'18", deviation 24", range 2'53"-3'56"

5 - DMPTU - 3'57", 3'18", 4'21", 3'17", 3'41" - average time 3'43", deviation 27", range 3'17"-4'21"

6 - L - 3'41"

The tightest solution times exist for the puzzle with the fastest solution, but the next two tightest distributions are for the two puzzles that respectively take the longest. When sorted by average solve time, both the fastest time on a puzzle, and the slowest time on a puzzle (the "range") are also in the indicated 1-5 order. Considering the relative position of the fastest versus the slowest solve, for these three puzzles the fastest solutions were #4, 3, 4, 5, and 4 (average of 4.0) and the slowest were #3, 4, 1, 2, 3 (average of 2.6) which suggests a slight possibility of some learning throughout the process although in a few cases, as on the puzzle with the highest variance (3), a very fast fourth solution was followed by a much slower fifth solution (by 102 seconds) that was only slight ahead of the overall slowest solution so a matter of a few seconds would change a 1 to a 5 and affect this averaging. Another hypothesis that accounts for the results is that, after solving several puzzles, the subject had "warmed up" and was more likely to achieve faster times.

Following a day off, the outliers - the 10 puzzles with the fastest and slowest respective times - were resolved in random order to determine if these differences were based on variations in the grids themselves, or by the stochastic nature of the subject's solution process. This test would also help determine if the "exact" puzzles have more consistent times than permuted grids.

Q - 1'39" first time, 1'47" second time, +8 seconds difference

K - 2'37" first time, 2'55" second time, +18 seconds difference

O - 1'52" first time, 2'40" second time, +48 seconds difference

V - 3'7" first time, 2'14" second time, -53 seconds difference

R - 1'53" first time, 2'9" second time, +16 seconds difference

E - 3'41" first time, 2'40" second time, -61 seconds difference

W - 2'43" first time, 2'43" second time, 0 seconds difference

C - 3'56" first time, 2'23" second time, -93 seconds difference

T - 3'17" first time, 3'38" second time, +21 seconds difference

P - 4'21" first time, 2'45" second time, -96 seconds difference

In 8 of the cases, the second time on a puzzle moved towards the mean by being slower for the fastest puzzle or faster for the slowest puzzle. In one case the exact time reoccurred; in the last case, a slightly slower time was observed for the slowest time. In 3 of the 5 pairs of puzzles, the slower puzzle in the first trial was solved faster than its partner in the second iteration. In 2 cases, for C and P, the worst solution time actually led to a new fastest solving time. Collectively, these results suggest that the outliers were not perceived as easier/harder by the solver consistently but were rather the result of random variation in the subject's search process from puzzle to puzzle. Given the magnitude of the time differences in the retest, we also conclude that the variation between "exact" duplicates of a puzzle is close to identical to the variation from transforming the grid. The day off effectively eliminated memory of an individual puzzle so that it behaved as an isomorphic grid. The value of using isomorphic grids is that an identical puzzle can be tested almost immediately after another one whereas we do not expect back-to-back solutions of an exact duplicate would behave in the same way.

The 26 puzzles were then evaluated by Scanraid Sudoku to determine their "grades".

1 - HJKQZ - 74, 75, 73, 74, 75 - average 74.2, deviation .8

2 - BGOVX - 92, 102, 113, 108, 94 - average 101.8, deviation 9.0

3 - EFIRY - 154, 104, 104, 109, 150 - average 124.2, deviation 25.5

4 - ACNSW - 165, 132, 147, 156, 137 - average 147.4, deviation 13.5

5 - DMPTU - 139, 139, 147, 130, 143 - average 139.6, deviation 6.3

6 - L - 96

Scanraid's score agrees on the "easiest" puzzles by rating, although it differs on the order of 4 and 5. 6 (the order 5 Franciscan rectangle) takes about as long in its single solve as the average time of the hardest puzzle, but scores much less by Scanraid. Intriguingly, the rating of some of the isomorphic grids, particularly E and Y, compared to F, I, and R, was grossly different. The subject's solving of the puzzles E and Y was also much slower than those for FIR for this highest variance case, well beyond expectation, which drew our interest.

To account for the variation observed between the isomorphic puzzles E and F, we first checked our software to confirm the puzzles were isomorphic and indeed they were. We then considered the relative Scanraid solving path for E versus F as comparison.

E displays two instances of Singles in Row/Col, then two instances of Singles in Box, then one instance of Naked Pairs/Triples, then sixteen consecutive uses of show possibles (one choice cells) solves the rest.

F displays two instances of Singles in Row/Col, then one instance of Singles in Box, then one instance of Naked Pairs/Triples, then thirteen consecutive uses of show possibles solves the rest.

The first major difference in E and F occurs in the Singles in Box step, with an extra instance of this occurring for E but with the same total number of digits (2) being placed after this step concludes. In the specific case of E, a single (7) then allows another single (6) in the same box and this is done in two steps. For F, the isomorphic position in just one step uses a single (8) to then allow another single (9). Digit order therefore seems to be hard-coded into Scanraid's approach of step counting. Swapping 6 and 7 in E does not change the overall rating from 154, but does reduce the number of iterations of Singles in Box from two to one to be more F-like.

The next major difference in the two solution paths is in the Naked Pairs/Triples step. Unfortunately, hidden pairs and triples are more common than naked pairs and triples for most solvers (and are identified one at a time, not en masse), but Scanraid identifies a slew of naked pairs/triples in a single step. In E, 1 pair (eliminating 9 candidates) and 6 triples (eliminating 16 candidates) are identified. In F, 1 pair (eliminating 9 candidates) and 7 triples (eliminating 17 candidates) are identified. The extra triple (and extra missing candidate) change the solving profile slightly at the end with 3 fewer "one option only" steps needed. This faster solution path must account for the relative difference in score.

To further examine how different transforms affect the Scanraid rating, distinct transformation types were generated and rescored for the grids E and F. Varying just the number identity of either E or F led to only small changes in the score (across five grids E varied from 151 to 163 and F varied from 104 to 113). Varying the row/horizontal chute ordering had a much larger effect on the score (E varied from 105 to 154 over five grids; F varied from 104 to 151). This shows that geometry (perhaps in top/bottom or left/right order) is also a meaningful difference in the scoring of puzzles by scanraid. Taking a mirror (across the main upper-left to lower-right diagonal) of E turns a 154 rating into a 108 rating. This transformation is minor so the large change in rating seems astounding and speaks further to top/bottom or left/right bias. Similarly, just rotating E 90 degrees counter-clockwise turns a 154 rating into a 108 rating. Since the numbers are just symbols and an 8 can be seen as an infinity symbol if looked at sideways, this is not a meaningful transformation at all but it accounts for the entire magnitude of the difference in rating between E and F. The 90 degree rotation also changes the 1 pair and 6 triples for 25 eliminated candidates step of E into the 1 pair and 7 triples for 26 eliminated candidates step seen for F.

**Discussion:**While Scanraid does offer some value in rating the fastest versus the slowest puzzles for this subject, we do not believe, within isomorphisms of a given puzzle, that the variation in Scanraid scores is meaningfully correlated with the experience of the tested subject. We posit that isomorphic grids should receive the same/similar scores from an ideal computer rater, but minor effects including digit order and top/bottom or left/right bias in this particular algorithm lead to large differences in rating (matching the untested hypothesis that a solver that goes in 1-9 order may find numbers at a different rate). An ideal rating system may require both a reordering of steps, as well as a consideration of solving path breadth. While Stuart has not divulged all the methodologies of his system [Stuart, 2007], his scoring has been evaluated using thousands of self-reported solving times in 5 or 10 minute wide ranges. As this study focuses on an "advanced" solver, these times would not be relevant for marking difficulty for this solver as the smallest reported time (< 5 minutes) accounts for all of the solutions here. It is an open question whether "beginner" or "intermediate" solvers are making use of the same techniques as an "advanced" solver, but with less efficiency and a constant scaling factor, or if their techniques are less developed and have their properties. A further consideration for this Scanraid scoring is that, if the self-reporting solvers are using parts of the Scanraid engine to solve the puzzles, which can present all candidates to a solver in its web-interface and printouts, the difficulty of naked singles (and pairs and triples) will be biased for these times relative to standard paper and pencil solving where solvers do not begin with candidate information and this information does not get updated during the solution process.

In contrast to Scanraid's approach, the given subject's style (see appendix A for a description of his heuristic) seems more tuned to individual occurrences of digits and geometry, and is therefore blind to most of these changes. The "deja vu" feel of certain puzzles at the start or end, as well as the identification of 4 puzzles out of 5 that felt close to the Franciscan Order 5 at some stage, suggests that many commonalities in solving path are found despite the highly transformed grids. Still, in any single trial, slight differences in searching pattern, particularly on the puzzles with the narrowest solving path, can lead to variations in times. On average, a standard deviation of about 15-20% in solve times seems likely on any single puzzle at these difficulties, although in the extreme a 2x time might be observed. At these difficulties, similar to those used by the Sudoku National Championship, all puzzles are being solved in less than 4 and a half minutes. Retesting with transformed grids should not lead to extremely high variations in round time.

For this finisher of three puzzles in 9-12 minutes, being retested and taking more than 16 minutes is not expected from this study. Consider if puzzles 3-5 constituted a round of the competition. The "best possible day" for the subject, taking his fastest times across all iterations, would allow for the round to be finished in 8 minutes, 3 seconds. The mean times for the subject would lead to completing the round in 9 minutes, 49 seconds. The "worst possible day" for the subject would leave a round time of 11 minutes, 58 seconds. The gap of "best" to "worst" day is less than a 50% increase of the fastest times. The gap of "best" to "mean" is less than 20%. Of course, making an error while solving a sudoku puzzle which requires full erasure of a grid would lead to much worse times than any of the instances observed here. Such errors occur all the time, but for identifying what a solver can do these "scrub" times should not be used.

In the Varshavsky case, having observed a single data point of finishing 3 specific puzzles in 14 minutes, we can now begin to consider what upper limit of solving time would be consistent with a legitimate competitor. Consider retesting the suspect on some number of identical (meaning isomorphic) puzzles from one to one trillion for the three round puzzles (~3 of each seems reasonable). These trials would establish both a "mean" and "best" time for a solver. Certainly, if the combined "mean" over three puzzles came within a minute of or even bested the observed time, this would remove suspicion from this solver. A "best" time that does not approach even a reasonable "mean" time, such as one 50% higher than the competition time or 20 minutes, would instead suggest cheating had occurred. Without actual proof of a method of cheating, organizers could allow the suspect to continue solving the 1 trillion possible puzzles until his "best" time could beat the 14 minute time, but by then his "mean" time and variance will be very well established allowing a less biased estimate of the probability that a legitimate solving approach was used during the competition.

In performing retesting with different solvers, it is likely that certain solving styles, particularly those with a high level of bifurcation/guessing, will have a particularly high variance in solving time. This study does not contain enough participants to report on such cases. Still, consider an extreme case in which a very difficult puzzle has an obvious bifurcation spot which will lead to a solution or not. The times for this puzzle will be bimodal, say 4 or 6 minutes per puzzle, tied to a coin-flip. A lucky run of three heads would lead to a 12 minute round while an unlucky performance would lead to an 18 minute round. Neither the "mean" time in this situation, nor the "best" time, would eventually come under scrutiny, particularly as it will be obvious by direct observation that the solver employs bifurcation, has high variance in his times, and may therefore just need a 1 in 8 chance of being lucky in each of three puzzles to qualify for the stage. Indeed, direct observation of the solver during retesting, to classify his solving heuristics in addition to assessing his speed, will clarify the results of invigilation.

All considered, whatever method Varshavsky used in Philadelphia to accomplish a 14 minute time in round 3 was not resistant to discovery after retesting, although without his failure to perform even adequately in the finals he would have escaped undetected at this championship under current protocol. Following retesting, where finishing even a single puzzle in 30 minutes was unlikely, it was not a hard decision to disqualify. Consider instead the more difficult case of a competitor who, on significant retesting, can only achieve a best time of 20 minutes, with ~6.5 minutes being the fastest he can solve any puzzle with an average of 8.5 minutes per and 1 minute standard deviation. One possible cheating strategy, relatively low-tech but still well-known in many testing situations, is to situate oneself near a more talented solver. Given the two-page booklet used in these championships, and the fact most solvers start on puzzle one and then just slide the booklet over to solve the second puzzle, imagine a cheater who solves puzzle two, then copies puzzle one from his neighbor to the right (taking about one minute which is a reasonable time for copying ~60 digits into a grid), and then solves puzzle three. This solver with a 6.5 minute "best" time per puzzle could then accomplish a 14 minute round by cheating, but under retesting could never come "close" in a probabilistic sense. A sufficiently stringent retesting procedure should prove the cheating even in this difficult case where a majority of the solving of a round was legitimate, but slower than other competitors. Ideally organizers would situate proctors and physical obstructions between competitors to reduce the opportunity for such an approach (or even a lesser variant of the approach where, after solving puzzle 3, a quick glance around after a "look at the clock" would allow for the copying of a handful of digits into both puzzles 1 and 2 to get a head-start on each).

In a world with many potential "Eugenes", with valuable prizes, and with increasingly miniaturized technologies that can assist cheating, it is reassuring to know that a relevant retesting approach may exist to identify cheaters if subjected to such testing. Particularly proactive organizers may make a mandatory "retest" a requirement for a podium finisher, similar to a drug test for a winning athlete. It is further reassuring to consider that basic steps such as handing out randomly permuted forms of the same puzzles to neighboring competitors could be used to remove/minimize the potential of the most common forms of cheating including prior knowledge of the test (there is no longer one test to memorize) and copying off a neighbor for sudoku. It is unlikely such steps are possible with almost any other kind of puzzle, including other logic puzzles or crossword puzzles.

While this study may seem like a rather extensive analysis for such a basic and fundamental result on puzzle-to-puzzle variation in solving times, it is the first study of its kind and should also serve as a good foundation for further studies that use isomorphic puzzles to explore the solving process. For example, differences in time for using "notes" versus "pencilmark-less" solving, differences arising from grid size both on paper and on whiteboards, and differences arising from the use of "reckless bifurcation" versus logical methods on puzzles of varying difficulties, should all be fruitful areas of further research. We will pursue these areas, and others, as our time and research budget allow.

**Appendix A:**The subject defined his general heuristic in this way:

Modes:

A. Randomly search, using chute-scanning, digit repetition, or geometry-based visual cues likely identified from memory due to pattern recognition, for the placement of either a number or an either/or notation for a single digit in a single region. Once a spot to write something is found, this is pushed onto a mental "stack" of information while the physical writing is occurring.

B. When there is a stack, consider the top-most position of the stack, searching for consequences in the row/column/box where the progress was made. Consequences include positive inferences about the digit on the stack, as well as negative inferences about other digits in the row/column/box. If a new spot to write a digit or note is found, this is pushed onto the stack while the writing is occurring.

Additional Considerations:

1. If, while in B mode, no consequences are found for the top element in the stack, that element is popped off the stack and solving continues in B or A mode depending on the presence of any other digits/notes in the stack.

2. When returning to A mode, some effort is made to not perform redundant searches given a cache of what has been searched in the past.

3. If in A for a sufficient length of time n (n = 30 to 180 seconds), enter a bifurcation at a spot with a high density of notes to return to B mode. The value of n depends on the expected difficulty of the puzzle, and is often correlated to the "trust" the subject has in the organizers of a competition.

While this heuristic is powerful, the random nature of the A mode - which can also be described as "what catches his eye first" - is likely to vary from iteration to iteration on a single puzzle as observed in the data. Most of the observed differences in time is likely from total A mode time, and not total B mode time, although it is hard to accurately tally either value for this subject.

**Appendix B:**Puzzles:

A - 0005060170009010360000000000002000900590

B - 0008000900000357060000218050926000708000

C - 0005090000038014000600000026300000455700

D - 0008030490002000107400000000004050610006

E - 0002008000005004000000030070004706209070

F - 7000300000805000000501000000189040000040

G - 0004065090005018020000900303060800200903

H - 9004108000108500090020000100060000501009

I - 4000200610000009007600380008000600530000

J - 0907000587000080960010500003209060004703

K - 8000030590307000000090800360000000006010

L - 8040930000725010000308000200806000903060

M - 0006040000060000088050000630009020000400

N - 0000000370700500966008000000200600138002

O - 0010700040409025000008010000300007005040

P - 0400700500001030000800900604000003005070

Q - 1070000320000000002030000980400006000084

R - 6009000005002000000100300008026010000030

S - 0000000006084070003026010000039000000000

T - 0005078030740000000004000065000100208000

U - 0000079000000241088900000000027000409800

V - 0000010500007203090005908049000705000430

W - 0706080206000005000007030000000000009030

X - 0000000455000078000703000267030950006020

Y - 0000058000006000040000035000000973800010

Z - 0000050609070800036039008000000080201080

**References:**

Crook, J. F. "A Pencil-and-Paper Algorithm for Solving Sudoku Puzzles". Notices of the AMS (56, No. 4), 460-468 (2009). Link

Russel, E. and Jarvis, F. "Mathematics of Sudoku II", self-published article, (2006) Link

Also see "There are 5472730538 essentially different Sudoku grids ... and the Sudoku symmetry group" - web-published article (2005) - Link and other enumerations by Jarvis and coworkers.

Stuart, A. C. "Sudoku Creation and Grading" - self-published article, (2007) Link

Stuart, A. C. "The Logic of Sudoku" - Michael Mepham Publishing (2007). Link

Stuart, A. C. Scanraid Sudoku Solving Engine (2005-9) Link

*Articles on the Varshavsky Case from motris.livejournal.com:*

The Cat in the Hat Strikes Back? (an Open Letter to the Sudoku community and the organizers of the Sudoku National Championship about the potential cheating of Eugene Varshavsky during this Saturday's tournament) (10-25-2009) Link

Some Picture Evidence for the Sudobomber (10-26-1009) Link

The Sudobomber story develops... (10-27-2009) Link

*Articles on the Varshavsky Case from the Philadelphia Inquirer:*

"Sudoku prizes frozen in cheating probe" (10-28-2009) by John Timpane Link

"Sudoku contest competitor to be retested" (11-11-2009) by John Timpane Link

"3d-place winner disqualified in Sudoku Scandal" (11-24-2009) by John Timpane Link

11 comments | Leave a comment