\myheading{SAT solver on top of regex matcher}
\label{regex_SAT}

\leveldown{}

\renewcommand{\CURPATH}{other/regex_SAT}
  
A SAT problem is an NP-problem, while regex matching is not.
However, a quite popular regex \textit{backreferences} extension extends regex matching to a (hard) NP-problem.
Backreferences usually denoted as \verb|\1|, \verb|\2|, etc.

Perhaps, the most practical use of backreferences I've heard
is HTML tag matching\footnote{\url{https://www.regular-expressions.info/backref.html}}

\begin{lstlisting}[caption=This regex isn't escaped properly]
<(.*)> ... </\1>
\end{lstlisting}

To match successfully, the second group must coincide with the first, like \verb|<b>test</b>|, but not \verb|<a>test</b>|.
Another practical usage I've heard: match \verb|"string"| or \verb|'string'|, but not \verb|"string'|.

Also, you can find two longest repeating substrings in an input string
\footnote{\url{https://stackoverflow.com/questions/9177647/regular-epxressions-that-matches-the-longest-repeating-sequence}
\url{https://stackoverflow.com/questions/44465284/match-longest-repeating-sequence-that-is-not-made-of-a-repeating-sequence}}.

Some other uses are very arcane:
divide input number by $\sqrt{2}$\footnote{\url{https://codegolf.stackexchange.com/questions/198427/shift-right-by-half-a-bit/198428\#198428}},
detecting factorial number\footnote{\url{https://codegolf.stackexchange.com/questions/121731/is-this-number-a-factorial/178979\#178979}}.

This text has been inspired by 
``Reduction of 3-CNF-SAT to Perl Regular Expression Matching''\footnote{\url{https://perl.plover.com/NPC/NPC-3SAT.html}},
please read it first.
However the author incorrectly states that only 3SAT problems are solvable. In fact, any SAT instance is solvable,
consisting of clauses of arbitrary sizes.

Also, since my SAT/CNF instances usually has more variables than 9 and I can't use backreferences like \verb|\1 ... \9|, I use
different method (Python): 

%\begin{framed}
%\begin{quotation}
\begin{lstlisting}
The syntax for backreferences in an expression such as (...)\1 refers to the number of the group. There’s naturally a variant that uses the group name instead of the number. This is another Python extension: (?P=name) indicates that the contents of the group called name should again be matched at the current point. The regular expression for finding doubled words, \b(\w+)\s+\1\b can also be written as \b(?P<word>\w+)\s+(?P=word)\b
\end{lstlisting}
%\end{quotation}
%\end{framed}
( \url{https://docs.python.org/3/howto/regex.html} )

See also: \url{https://www.regular-expressions.info/replacebackref.html}.

Now let's take this small CNF instance:

\begin{lstlisting}
p cnf 5 11
-2 -5 0
-2 -4 0
-4 -5 0
-2 -3 0
-1 -4 0
-1 -5 0
-1 -2 0
-1 -3 0
-3 -4 0
-3 -5 0
1 2 3 4 5 0
\end{lstlisting}

% TODO \ref{}
This is what I call \emph{popcnt1}: only 1 variable must be true, all the rest are always false.

\begin{lstlisting}
% picosat --all popcnt1.cnf
  
s SATISFIABLE
v -1 -2 -3 -4 5 0
s SATISFIABLE
v -1 -2 -3 4 -5 0
s SATISFIABLE
v -1 -2 3 -4 -5 0
s SATISFIABLE
v -1 2 -3 -4 -5 0
s SATISFIABLE
v 1 -2 -3 -4 -5 0
s SOLUTIONS 5
\end{lstlisting}

Now let's translate it to regex:

\begin{lstlisting}
string=xxxxx;x,x,x,x,x,x,x,x,x,x,x,
pattern=^(?P<a_1>x?)(?P<a_2>x?)(?P<a_3>x?)(?P<a_4>x?)(?P<a_5>x?).*;(?:(?P=a_2)x|(?P=a_5)x),(?:(?P=a_2)x|(?P=a_4)x),(?:(?P=a_4)x|(?P=a_5)x),(?:(?P=a_2)x|(?P=a_3)x),(?:(?P=a_1)x|(?P=a_4)x),(?:(?P=a_1)x|(?P=a_5)x),(?:(?P=a_1)x|(?P=a_2)x),(?:(?P=a_1)x|(?P=a_3)x),(?:(?P=a_3)x|(?P=a_4)x),(?:(?P=a_3)x|(?P=a_5)x),(?:(?P=a_1)|(?P=a_2)|(?P=a_3)|(?P=a_4)|(?P=a_5)),
SAT
1 -2 -3 -4 -5
\end{lstlisting}

You see: an input string has as many x's as many variables the CNF instance has.
Then, as many \verb|x,| substrings, as many clauses the instance has.

Take the first part of the pattern: \verb|^(?P<a_1>x?)(?P<a_2>x?)(?P<a_3>x?)(?P<a_4>x?)(?P<a_5>x?).*; ...|

Roughly, it means that the group can match \verb|x|, or may not. Let matcher decide on its own.
The second part of the pattern are clauses.
For the \verb|1 2 3 4 5| clause we have \verb$(?:(?P=a_1)|(?P=a_2)|(?P=a_3)|(?P=a_4)|(?P=a_5)),$ there.
That means, that the whole group \emph{must} match \verb|x|, but we don't tell how: one of 5 subgroups may match,
and each subgroup is actually backreference to the first part of pattern.
But these terms are all positive.
What about negative terms?

For the \verb|-2 -5| clause, we have \verb$(?:(?P=a_2)x|(?P=a_5)x),$ there.
That means that the whole group \emph{must} match \verb|x|, and again, we don't tell how,
but both backreferences \verb|a_2| and \verb|a_5| are prohibited, if present simultaneously.
It's OK for \verb|a_2| to match \verb|x|, but then the second part of the union would match.
It's OK for \verb|a_5| to match \verb|x|, and then the first part of the union would match.
Also, it's OK for both \verb|a_2| and \verb|a_5| match nothing: either part of union will match \verb|x| then.

If regex can't match, this means the input instance is unsatisfiable.

Likewise, I can run my \emph{popcnt4} instance, which commands to be 4 of input 8 variables be true:

\begin{lstlisting}
string=xxxxxxxx;x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,
pattern=^(?P<a_1>x?)(?P<a_2>x?)(?P<a_3>x?)(?P<a_4>x?)(?P<a_5>x?)(?P<a_6>x?)(?P<a_7>x?)(?P<a_8>x?).*;(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_4)x|(?P=a_5)x),(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_4)x|(?P=a_6)x),(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_4)x|(?P=a_7)x),(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_4)x|(?P=a_8)x),(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_5)x|(?P=a_6)x),(?:(?P=a_1)x|(?P=a_2)x|(?P=a_3)x|(?P=a_5)x|(?P=

...

P=a_7)|(?P=a_8)),(?:(?P=a_3)|(?P=a_5)|(?P=a_6)|(?P=a_7)|(?P=a_8)),(?:(?P=a_4)x|(?P=a_5)x|(?P=a_6)x|(?P=a_7)x|(?P=a_8)x),(?:(?P=a_4)|(?P=a_5)|(?P=a_6)|(?P=a_7)|(?P=a_8)),
SAT
1 2 3 4 -5 -6 -7 -8
\end{lstlisting}

Also, I managed to solve the \emph{Fred puzzle}\ref{FredPuzzle} in SAT/CNF form:

\begin{lstlisting}
string=xxxxxxxxxxxxxxxxxxxxxxxxxxxx;x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,
pattern=^(?P<a_1>x?)(?P<a_2>x?)(?P<a_3>x?)(?P<a_4>x?)(?P<a_5>x?)(?P<a_6>x?)(?P<a_7>x?)(?P<a_8>x?)(?P<a_9>x?)(?P<a_10>x?)(?P<a_11>x?)(?P<a_12>x?)(?P<a_13>x?)(?P<a_14>x?)(?P<a_15>x?)(?P<a_16>x?)(?P<a_17>x?)(?P<a_18>x?)(?P<a_19>x?)(?P<a_20>x?)(?P<a_21>x?)(?P<a_22>x?)(?P<a_23>x?)(?P<a_24>x?)(?P<a_25>x?)(?P<a_26>x?)(?P<a_27>x?)(?P<a_28>x?).*;(?:(?P=a_1)x),(?:(?P=a_2)),(?:(?P=a_3)x|(?P=a_4)x|(?P=a_9)x),(?:(?P=a_3)|(?P=a_4)|(?P=a_9)x),(?:(?P=a_3)|(?P=a_4)x|(?P=a_9)),(?:(?P=a_3)x|(?P=a_4)|(?P=a_9)),(?:(?P=a_10)x|(?P=a_9)x),(?:(?P=a_10)|(?P=a_9)),(?:(?P=a_11)x|(?P=a_10)x),(?:(?P=a_11)|(?P=a_10)),(?:(?P=a_11)),(?:(?P=a_5)x|(?P=a_6)x|(?P=a_12)x),(?:(?P=a_5)|(?P=a_6)|(?P=a_12)x),(?:(?P=a_5)|(?P=a_6)x|(?P=a_12)),(?:(?P=a_5)x|(?P=a_6)|(?P=a_12)),(?:(?P=a_13)x|(?P=a_12)x),(?:(?P=a_13)|(?P=a_12)),(?:(?P=a_14)x|(?P=a_13)x),(?:(?P=a_14)|(?P=a_13)),(?:(?P=a_14)),(?:(?P=a_7)x|(?P=a_8)x|(?P=a_15)x),(?:(?P=a_7)|(?P=a_8)|(?P=a_15)x),(?:(?P=a_7)|(?P=a_8)x|(?P=a_15)),(?:(?P=a_7)x|(?P=a_8)|(?P=a_15)),(?:(?P=a_16)x|(?P=a_15)x),(?:(?P=a_16)|(?P=a_15)),(?:(?P=a_17)x|(?P=a_16)x),(?:(?P=a_17)|(?P=a_16)),(?:(?P=a_17)),(?:(?P=a_3)x|(?P=a_6)x|(?P=a_18)),(?:(?P=a_3)|(?P=a_18)x),(?:(?P=a_6)|(?P=a_18)x),(?:(?P=a_8)x|(?P=a_18)x|(?P=a_19)x),(?:(?P=a_8)|(?P=a_18)|(?P=a_19)x),(?:(?P=a_8)|(?P=a_18)x|(?P=a_19)),(?:(?P=a_8)x|(?P=a_18)|(?P=a_19)),(?:(?P=a_20)x|(?P=a_19)x),(?:(?P=a_20)|(?P=a_19)),(?:(?P=a_20)),(?:(?P=a_21)x|(?P=a_7)x),(?:(?P=a_21)|(?P=a_7)),(?:(?P=a_21)|(?P=a_5)|(?P=a_22)x),(?:(?P=a_21)x|(?P=a_22)),(?:(?P=a_5)x|(?P=a_22)),(?:(?P=a_4)x|(?P=a_22)x|(?P=a_23)x),(?:(?P=a_4)|(?P=a_22)|(?P=a_23)x),(?:(?P=a_4)|(?P=a_22)x|(?P=a_23)),(?:(?P=a_4)x|(?P=a_22)|(?P=a_23)),(?:(?P=a_24)x|(?P=a_23)x),(?:(?P=a_24)|(?P=a_23)),(?:(?P=a_24)),(?:(?P=a_3)|(?P=a_7)|(?P=a_25)x),(?:(?P=a_3)x|(?P=a_25)),(?:(?P=a_7)x|(?P=a_25)),(?:(?P=a_6)x|(?P=a_25)x|(?P=a_26)),(?:(?P=a_6)|(?P=a_26)x),(?:(?P=a_25)|(?P=a_26)x),(?:(?P=a_6)x|(?P=a_26)x|(?P=a_27)x),(?:(?P=a_6)|(?P=a_26)|(?P=a_27)x),(?:(?P=a_6)|(?P=a_26)x|(?P=a_27)),(?:(?P=a_6)x|(?P=a_26)|(?P=a_27)),(?:(?P=a_28)x|(?P=a_27)x),(?:(?P=a_28)|(?P=a_27)),(?:(?P=a_28)),
SAT
-1 2 -3 4 5 -6 7 -8 9 -10 11 12 -13 14 15 -16 17 -18 -19 20 -21 22 -23 24 25 -26 -27 28
\end{lstlisting}

It took ~3 minutes on my old CPU clocked at ~2GHz.

The files, including the \emph{solver} in Python 3: \url{\RepoURL/\CURPATH/files/}.

Of course, all this stuff isn't practical at all.
But it demonstrates \emph{reduction} from one problem (regex matching with backreferences) to another (SAT).
Find a better algorithm for any of these problem and this
would lead to revolution in computer science.

A discussion at HN: \url{https://news.ycombinator.com/item?id=23597573}.

\myheading{Integer factorization using regex (with backreferences)}

\emph{gbacon} \href{https://news.ycombinator.com/item?id=23597573}{at HN} pointed to
\href{https://cstheory.stackexchange.com/questions/448/regular-expressions-arent}{a method of integer factorization using regex}.

(Unary encoding is "" for 0, "1" for 1, "11" for 2, "11111" for 5, etc.)

I simplified it a bit, because the first part of \verb!^1?$|^(11+?)\1+$! is just a check against an "1" string
(which is prime) and empty string (which is for 0) (\verb|^1?$|),
and I removed it for clarity:

\begin{lstlisting}[style=custompy]
#!/usr/bin/env python3

import re

#n=12300 # composite
#n=123001 # prime, 27s
#n=12300200 # composite
n=123023 # composite, one factor: 43
#n=123027 # composite, one factor: 3
#n=223099 # prime, 87s

regex=re.compile("^(11+?)\\1+$")
res=regex.match("1"*n)
if res==None:
    print ("prime")
else:
    x=len(res[1])
    y=n/x
    print ("composite: %d * %d = %d" % (x, y, n))
\end{lstlisting}

It can find factors for small numbers.
And here is how it works.
In plain English, we asking regex matcher to find such a string, that consists of some number ($\geq 2$)
of "1"'s (\verb|(11+?)|),
which is glueled with the same string (\verb|\1|) arbitrary number of times (\verb|+|).

Of course it's extremely slow, and even worse than bruteforce.
For ~87 seconds on my old 2GHz CPU I found out that 223099 is a prime.

But again, this is like a thought experiment.
A reduction from one problem (integer factorization) to another (find equal substrings in a string).

\myheading{Allan Wirth's version}

\begin{lstlisting}
From: Allan Wirth <allanlw(a)gmail.com>
Subject: PCRE JITing your regex sat solver

Hi Dennis,

I really appreciated your simple "regex" to 3-sat reduction.

I was curious to see how much perf could be gained from using the PCRE
JIT with your regex SAT solver, so I reimplemented it in PHP (I don't
know Perl and PHP has native PCRE support).

Source: https://gist.github.com/allanlw/69df509519335b88db886d48503a0f15

Time comparison below. It's about 50% constant speedup. Thought you
might be interested.

Cheers,
Allan
\end{lstlisting}

\myheading{One more usage of backreferences}

This time, I searched for good words that can serve as examples for
my example about Knuth-Morris-Pratt algorithm \ref{KMP}.
I wanted a list of words that have repeated prefixes and suffixes.

I took \href{https://github.com/dwyl/english-words}{a good collection of English words here}.

Then I used \verb|sed| to find words with repeated prefixes:

\begin{lstlisting}
sed -E -n '/^(.+)\1(.+)$/p' words_alpha.txt
\end{lstlisting}

Some of them:

\begin{lstlisting}
eel
oops
ooze
cocoa
cocos
kokos
mimic
cocoon
\end{lstlisting}

I couldn't manage \verb|sed| to find repeated suffixes, so I wrote a Racket program to do that (each suffix must have at least two characters):

\begin{lstlisting}
#lang racket

;(define r (pregexp "^(.+)\\1(.+)$")) ; two prefixes
(define r (pregexp "^(.+)(..+)\\2$")) ; two suffixes >=2

(define (f s)
  (regexp-match r s))

(define result
  (sort
   (filter f (file->lines "words_alpha.txt"))
   (lambda (x y)
     (< (string-length x) (string-length y)))))

(for ([i result])
  (displayln i))
\end{lstlisting}

Some of these:

\begin{lstlisting}
ceded
crisis
rococo
cantata
\end{lstlisting}

That sounds as a list of diseases:

\begin{lstlisting}
hydrofluosilicic
integropallialia
interjaculateded
panmyelophthisis
plasmaphoresisis
pneumonophthisis
antihemagglutinin
ophthalmophthisis
bacterioagglutinin
erythrocytoschisis
phytohemagglutinin
thoracoceloschisis
craniorhachischisis
phytohaemagglutinin
thoracogastroschisis
\end{lstlisting}

I couln'd stand the itch and tried to find all words with 3 repeated suffixes:

\begin{lstlisting}
(define r (pregexp "^(.+)(.+)\\2\\2$"))
\end{lstlisting}

That includes both words with 3 repeated characters at the end and the rare term 'ratatat' -- thricely repeated 'at' suffix:

\begin{lstlisting}
brrr
ieee
mmmm
oooo
viii
xiii
xviii
xxiii
ratatat
earlesss
\end{lstlisting}

('earlesss' seems to be a typo in the list of words I used.)

\levelup{}