Problem 44383. Code breaker, Part III: Operation Xiangliu
You have been tasked with decoding several batches of coded messages that have been intercepted.
Based on previous intelligence that has been gathered, you can be confident that the messages were all encoded using a simple Caesar cipher (a type of substitution cipher), an example of which is the ROT13 cipher (discussed in Cody Challenge Problem 78). As in the Cody Challenge Problem, uppercase and lowercase letters are handled independently of one another, and all other characters (e.g. punctuation, numbers, accented letters) are unchanged. Unlike the Cody Challenge Problem, here the number of letters to shift is not known in advance, and will vary between (not within) batches — also, here you need to decode, not encode.
These latest activities that you are investigating have been nicknamed 'Operation Xiangliu' by your own organisation. A few decoding options are at your organisation's disposal:
- Test the candidate decodings against all words in a large dictionary — this could work, but it is very slow.
- Test the candidate decodings for the appearance of a name or phrase that is certain to appear regularly (e.g. "Operation Phoenix") — unfortunately as yet there is not enough knowledge of the activities to be sure of a suitable name or phrase to use.
- Test the candidate decodings against all words in a short list comprising, say, the hundred most frequently used English words.
The third option will be faster than the first option, and more reliable than the second option, so this is the approach you will be taking.
You have been careful to avoid using 'lemmatised' lists (which contain e.g. the verb "to be", rather than the various inflected forms such as "am", "is", "are") like those based on the OEC or COCA, and after setting aside another list you finally choose the list based on the BNC as the most reliable, and will use the first 100 words on that list. This list will be available for you to access as an input variable, bncWordlist, to your function. Note: (i) some entries on the list are morphemes (e.g. " n't " and " 's ") rather than words; (ii) some entries appear more than once (representing different grammatical word classes). Of course, in the original messages any capitalisation might be used.
You have been instructed that your 'certitude' (degree of confidence) in the decoding must be reported for each batch, and shall depend proportionally upon the sum of the characters for the words/morphemes in the decoded message that match words/morphemes in bncWordlist. Being able to match words/morphemes accounting for three-tenths of the characters (excluding spaces) shall correspond to 100% certitude; matching three-twentieths would be 50% certitude, and so on. Certitude shall be reported as a percentage, rounded up to the nearest integer, not greater than 100. You need to maximise your certitude for each batch by appropriate choice of the shifting parameter. If there are multiple shifts of equal certitude, report both options on different rows (arranged in order of ascending shift parameter).
Your task is to crack the codes and report back in a structure array: (1) the shifting parameter that had been used in the encoding [as uint8] (usually scalar, but may be column vector); (2) the decoded messages [as a cell array (containing character arrays)] (usually an array with a single row, but occasionally with multiple rows); (3) your 'certitude' in the decoding [as uint8] (always scalar). The name of the structure array shall be s, with respective fields shift, message, and certitude.
EXAMPLE 1
Suppose the batch contained two encoded messages — "Vomftt qvstvfe, pqfo op eppst." and "Ffmt dbo ljmm, opu pomz xpvoe." (provided as character arrays within a cell array) — and a (right-shifting) ROT1 cipher had been applied. In that case A→B, B→C, ..., Y→Z, and Z→A; similarly, a→b, b→c, ..., y→z, and z→a. Thus the original messages would have been: "Unless pursued, open no doors." and "Eels can kill, not only wound." . Twelve of the 51 characters have been matched: "no", "can", "not", and "only".
The correct answer would therefore comprise:
s.shift = uint8(1) s.message = {'Unless pursued, open no doors.', 'Eels can kill, not only wound.'} s.certitude = uint8(79)
EXAMPLE 2
Suppose the batch contained one encoded message — "Oa oqvvq'u cnycau dggp: "Ctu itcvkc ctvku"." (provided as a character array within a cell array) — and a (right-shifting) ROT2 cipher had been applied. In that case A→C, B→D, ..., Y→A, and Z→B; similarly, a→c, b→d, ..., y→a, and z→b. Thus the original message would have been: "My motto's always been: "Ars gratia artis"." . Eight of the 37 characters have been matched: "My", " 's ", and "been". Note carefully that: " 's " should only be matched once; "to" (in motto), "be" (in been), "at" (in gratia), "is" (in artis) and "a" (passim) should not be matched at all; and " ' " will only ever be used as an apostrophe (never as a quotation mark).
The correct answer would therefore comprise:
s.shift = uint8(2) s.message = {'My motto''s always been: "Ars gratia artis".'} s.certitude = uint8(73)
If the batch of messages cannot be decoded when following the above assumptions, then return the original encoded message(s) unchanged, with a shift of zero and a certitude of zero.
Note: Many Cody solutions are hard to read and not necessarily computationally efficient. To direct attention toward efficient runtime speed of execution, timings are measured in the Test Suite, and reported back (if the submission is runnable). Although the timings are not perfectly reproducible, they do provide an indication of computational resource demand. (Try resubmitting on another day if your code runs slowly, in case it is caused by a server issue.)
To provide some extra motivation for you to get your code to run efficiently, it will fail the Test Suite if it is deemed "too slow".
----------
Previous problem: Operation Orthos. Next problem: TBA.
Solution Stats
Problem Comments
-
6 Comments
I don't understand the computing of the certitude parameter.
In Example1: ...Twelve of the 51 characters have been matched.. Why 51 characters ?
In Example2:... Eight of the 37 characters have been matched: "My", " 's ", and "been". 7 not 8 characters ?
Could you be more specific ?
Thank you for these great problems !
Hello, Jean-Marie Sainthillier. Thanks for your feedback. Regarding Ex. 1, the key information is given in a preceding paragraph about the general calculation of 'certitude', which states: "[...] accounting for [...] of the characters (excluding spaces)". I excluded spaces because there are none in the 'dictionary', so they would skew the results. E.g. if the phrase "They didn't understand" were padded with one hundred spaces before encoding, then we would always get a low 'certitude' if spaces weren't excluded (viz. 28%), suggesting only low–moderate confidence in the decryption, despite decoding half of the matchable characters, which should have given us high confidence (viz. 100%) given the small size of our 'dictionary'. Arguably punctuation could have been excluded too, but I opted to _include_ punctuation, not least because apostrophes do occur in the 'dictionary'. That is why there are 8 matched characters in Ex. 2: the apostrophe is counted (in _'s_). In a bigger 'dictionary', hyphens and full stops could also occur. Some punctuation, such as commas and semicolons, would perhaps never occur in any practical 'dictionary', but their effect on the 'certitude' is relatively small, and segregating punctuation according to function is beyond the present scope!! —DIV
Thank you :)
I would like to suggest that when timing code for exercises that you only measure the function implemented by the user. In this case, your test suite code should be:
for i = 1 : qBig
% EDIT (2018-06-17). Ensure each case is unique.
characters = [' , .' char(randi([97,122], [1,23]))];
x(1+i) = {characters( randperm(numel(characters)) )};
% END EDIT (2018-06-17)
end;
t0 = clock;
t0_cpu = cputime;
% Loop
for i = 1 : qBig
y(1) = x(1);
y(2) = x(1+i);
solution = decode(y, bncWordlist);
end;
% Compute and display elapsed time.
dt = etime(clock, t0);
dt_cpu = (cputime - t0_cpu);
for dummy = 1 : 10, disp(' . '); end;
fprintf('Your wall time to decode %u message batches = %2.2f seconds.\n\r', qBig, dt)
Finally, I am not sure that cputime or clock are indeed good tools for measuring efficient code (this is not their purpose). Timeit is much more adequate since it runs our code several times and uses the median for removing noise. The function cputime measures the amount of work that the CPU is currently doing, which may include everything that is running on a MATLAB session.
This problem is way too hard. It is not enough to solve the test cases or do vectorized code, we must find the right combination of MATLAB functions that achieve the greatest speed, which is boring. This means we have to write many times the same code until the right combination of functions becomes obvious (I had to create 6 different codes or more, until I finally got a valid solution). Moreover it is not clear why the number 8 (seconds) is somehow a fair measurement of timing for this task since we must try all combinations and deal with strings. Operation Phoenix and Orthos were nice, good job. They had a limit of 3 seconds, but we didn't have to look up words in a dictionary (adding just 5 more seconds for searching in a dicionary of 100 words that must be used at every iteration doesn't seem fair). Even after converting the dictionary words to hash keys, I was able to barely pass the test suite.
Hello, Rafael S. T. Vieira.
Firstly, thank-you for your constructive feedback.
When I wrote the Test Suite — a few years ago — I explored a few options for the timing, including use of the timeit() function. I believe that there was a specific rationale for the choice I ultimately made, although I can't confidently recall all the details off the top of my head.
As for the implementation of the timing, I concede that currently it includes some overhead from generating some random strings. However, I doubt that the overhead is 'significant' — I would be *guessing* that it would account for less than 1% of the total time elapsed. So while your proposed amendment could be a nice improvement, it's probably best left as a tip for future Problems: I prefer not to change an existing Test Suite unless there is a compelling reason.
Next, let me share with you the basis for the 8-second benchmark. I wrote an unoptimised reference solution and added a small safety margin. My hypothesis was that if my unoptimised code could satisfy the Test, then most likely other (experienced) Cody players could also pass the Test if they avoid slow operations. When the Problem was published, indeed the Cody players who have solved it were able to pass the test using a variety of approaches. Indeed, some advanced Cody players were able to highly optimise their code and deliver impressively quick times.
You may notice that I built in an allowance for increasing computational power. This allowance also had a safety margin. In other words, the increased number of iterations is somewhat less than the projected increase (from historical data) that would maintain equity. Of course, even predictions based on data can be wrong, so I have checked my reference solution again (twice).
I got the following:
"Your wall time to decode 2370 message batches = 7.99 seconds."
"Your wall time to decode 2370 message batches = 7.62 seconds."
In both cases it passed. The first instance was amusingly close to the limit, but I mention again that this code was not optimised yet it still passed. I can assure you that it has no "hashes".
Finally, the subtext is that this is intended to be a difficult Problem: a challenge for experienced Cody players. Some will like that, and will even enjoy challenging themselves to optimise their own code. Others won't be interested, and there are thousands of other Problems on Cody for them to solve.
—DIV
Solution Comments
Show commentsProblem Recent Solvers11
Suggested Problems
-
Back to basics 3 - Temp Directory
368 Solvers
-
Sum of diagonal of a square matrix
1575 Solvers
-
409 Solvers
-
Rotate and display numbered tile
344 Solvers
-
MATCH THE STRINGS (2 CHAR) very easy
286 Solvers
More from this Author32
Problem Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!