\myheading{Fun with regular expressions} \label{RE_DFA} \renewcommand{\CURPATH}{regexp/fun} \leveldown{} Some years ago, when i struggled to understand regular expressions, I found these experiments useful. The following is like my student's notes. I hope, someone will find it useful too. TL;DR: Regular expressions usually used for finding patterns in a string/text and also simple parsing tasks. \myheading{colou?r} This is a classic example in many \ac{RE} tutorials -- a \ac{RE} that matches both U.S. "color" and British "colour". I'm going to use \href{https://github.com/katef/libfsm}{libfsm}, a tool that can produce a \ac{DFA} for a specific \ac{RE}. This is a \ac{DFA} for it: \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/color.svg} \end{figure} But we are programmers, not mathematicians. \textit{libfsm} can also produce a pure C function that will behave as this \ac{DFA}: \lstinputlisting[style=customc]{\CURPATH/files/libfsm/color.c} Now this is important part -- this function can be used \textit{in place} of \ac{RE} matcher. It will work exactly as it. In fact, it emulates \ac{DFA} for this \ac{RE}. It can be used instead of \ac{RE} matcher in practice, because it works way faster than any \ac{RE} library. Of course, it has an obvious limitation -- it's hardcoded/hardwired to a single \ac{RE}. It's very easy to implement \ac{DFA} in hardware: \begin{figure}[H] \centering \includegraphics[width=0.6\textwidth]{\CURPATH/04223.png} \caption{} \end{figure} (The picture has been shamelessly stolen from \href{https://instrumentationtools.com/topic/finite-state-machine/}{this page}.) You see, there is only \ac{ROM} (as a transition function) and (shift) register. Nothing else. (Please ignore 'outputs' at \ac{ROM}.) This can be implemented in \ac{FPGA} or \ac{ASIC} without any problem. In practice, it can filter a voluminous network traffic for a pattern or \ac{RAM}: without using a \ac{CPU} or \ac{GPU}. So if your \ac{FPGA}/\ac{ASIC} is clocked at 1GHz, it can process $10^9$ symbols per second, no matter how complex your \ac{RE} is. \myheading{(Net|Open|Free)BSD} ... to be matched with main BSD forks. \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/BSD.svg} \end{figure} \lstinputlisting[style=customc]{\CURPATH/files/libfsm/BSD.c} \myheading{(s|S)t(even|ephen|eve|evie)} ... to be matched with \textit{steven/stephen/steve/stevie}, but also the first letter can be captial. \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/steven.svg} \end{figure} This is important: there are two \textit{accepting states} (in double circles). How it can be handled in C? (States S6 and S8.) \lstinputlisting[style=customc]{\CURPATH/files/libfsm/steven.c} \myheading{(dark|light)?(red|green|blue)(ish)?} ... something more advanced. Here we see a limitation of \ac{DFA}. (Well, not a limitation.) This \ac{DFA} is correct. But it will be more (visually) easier to represent it in \ac{NFA}, but this is another story. \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/color2.svg} \end{figure} \ac{BTW}, \textit{libfsm} can produce a \ac{JSON} file with all transitions between all the vertices: \lstinputlisting{\CURPATH/files/libfsm/color2.json} And I wrote a simple utility to enumerate all possible input strings that this \ac{DFA} will accept, in Racket. It is trivial. It just traverses a \ac{DFA} recursively down to all accepting states. \lstinputlisting{\CURPATH/files/libfsm/enum_strings.rkt} What it will say for that \ac{DFA}? \begin{lstlisting} string matched: [red] string matched: [redish] string matched: [lightred] string matched: [lightredish] string matched: [lightgreen] string matched: [lightgreenish] string matched: [lightblue] string matched: [lightblueish] string matched: [green] string matched: [greenish] string matched: [darkred] string matched: [darkredish] string matched: [darkgreen] string matched: [darkgreenish] string matched: [darkblue] string matched: [darkblueish] string matched: [blue] string matched: [blueish] total: 18 \end{lstlisting} \myheading{(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)} In \ac{NFA}, we would just create 12 paths. Alas, this is \ac{DFA}, so these paths are overlapped at some vertices: \begin{figure}[H] \centering \includesvg[width=0.5\textwidth]{\CURPATH/files/libfsm/month.svg} \end{figure} \myheading{(1|2|3|4|5|6|7|8|9|10|11|12)} '1' is also \textit{accepting state}, but can also lead to '10', '11' and '12': \begin{figure}[H] \centering \includesvg[width=0.5\textwidth]{\CURPATH/files/libfsm/two_digit_number.svg} \end{figure} \myheading{(01|02|03|04|05|06|07|08|09|1|2|3|4|5|6|7|8|9|10|11|12)} \begin{figure}[H] \centering \includesvg[width=0.5\textwidth]{\CURPATH/files/libfsm/two_digit_number_zero.svg} \end{figure} \myheading{(01|02|03|04|05|06|07|08|09|10|11|12):[0-5][0-9]:[0-5][0-9]\textvisiblespace{}(A|P)M} (Both hours and minutes are always has two digits.) \begin{figure}[H] \centering \includesvg[width=0.7\textwidth]{\CURPATH/files/libfsm/AM_PM_time_2_digits.svg} \end{figure} \myheading{[0123][0-9]-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(19|20)[0-9]\{2\}} For the sake of simplicity, only years 19xx and 20xx are accepted: \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/date0.svg} \end{figure} \myheading{[0123][0-9]-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?(19|20)[0-9]\{2\}} But what if the dash is optional? \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/date1.svg} \end{figure} \myheading{([0123][0-9]-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(19|20)[0-9]\{2\}|(19|20)[0-9]\{2\}-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-[0123][0-9])} Can accept dates in DD-MM-YYYY or YYYY-MM-DD format, dash must be present: \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/date2.svg} \end{figure} \myheading{([0123][0-9]-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?(19|20)[0-9]\{2\}|(19|20)[0-9]\{2\}-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?[0123][0-9])} Can accept dates in DD-MM-YYYY or YYYY-MM-DD format, a dash is optional. Now the \ac{DFA} is too large: \begin{figure}[H] \centering \includesvg[width=0.8\textwidth]{\CURPATH/files/libfsm/date3.svg} \end{figure} Note: using the \textit{standard} \ac{RE}, it's possible to match only DD-MM-YYYY or DDMMYYYY strings without matching DD-MMYYYY or DDMM-YYYY, but the final \ac{DFA} would be bulky. \myheading{var1\textvisiblespace{}+var2} Like I stated before, \ac{RE} often used for parsing. Here are more practical examples. What it accepts? I'm going to try my utility again: \begin{lstlisting} string matched: [var1 var2] string matched: [var1 var2] string matched: [var1 var2] string matched: [var1 var2] string matched: [var1 var2] string matched: [var1 var2] total: 6 \end{lstlisting} Obviously, a set of input strings is infinite. So my utility has some limit and stopped after it is reached. \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/plus.svg} \end{figure} \myheading{var1\lbrack{}\textvisiblespace{}\textbackslash{}t\rbrack+var2} (Extending this \ac{RE} to whitespaces, i.e., tab symbol.) \begin{figure}[H] \centering % FIXME. can't include svg %\includesvg[width=1\textwidth]{\CURPATH/files/libfsm/plus_tab.svg} \includegraphics[width=1\textwidth]{\CURPATH/files/libfsm/plus_tab.png} \end{figure} \myheading{\textvisiblespace{}*var1\textvisiblespace{}*=\textvisiblespace{}*var2\textvisiblespace{}*} Even more practical. A whitespace can surround input string and the equality sign. Kleene star is used here. \begin{figure}[H] \centering \includesvg[width=1\textwidth]{\CURPATH/files/libfsm/kleene.svg} \end{figure} \myheading{(0|(1(01*0)*1))*} This \ac{RE} accepts input numbers (in binary form) that are divisible by 3 (\ref{DFA_prime}). \href{https://stackoverflow.com/questions/7974655/regex-for-binary-multiple-of-3}{(More about it.)} See also about DFA accepting a binary substring divisible by prime number: \ref{DFA_prime}. \begin{figure}[H] \centering \includesvg[width=0.5\textwidth]{\CURPATH/files/libfsm/divisible_by_3.svg} \end{figure} My utility correctly enumerates such string. But only up to some length... \begin{lstlisting} ... string matched: [000110] string matched: [00011011] string matched: [0001101111] string matched: [000110111111] string matched: [00011011111111] string matched: [0001101111110] string matched: [00011011111100] string matched: [00011011111001] string matched: [00011011110] string matched: [0001101111011] string matched: [00011011110110] string matched: [000110111100] string matched: [00011011110011] ... \end{lstlisting} Of course, the number of leading zeroes doesn't affect anything. \input{\CURPATH/config} \myheading{(\lbrack{}az\rbrack{}+\textvisiblespace{}*,)+\lbrack{}az\rbrack{}+\textvisiblespace{}*} Even harder \ac{RE}: for \ac{CSV} files. For the sake of simplicity, it only accepts strings consisting of 'a' or 'z'. \begin{figure}[H] \centering \includesvg[width=0.8\textwidth]{\CURPATH/files/libfsm/CSV.svg} \end{figure} It's like \textit{nested} or \textit{recursive} \ac{RE}. But probably too complicated -- a more advanced parser should be used for \ac{CSV} instead of \ac{RE}. \myheading{Syntactic sugar} All the rest is syntactic sugar: "X+" is for "XX*", "X{3}" is for "XXX", "X{1-2}" is for "X|XX". Syntactic sugar is important in practice, but it's better to focus on fundamentals when you start learning. \myheading{Homework} All this code is like "grep", it doesn't return \textit{groups} of parsed line, like MM, DD and YYYY separately. You can try hacking what libfsm produces in pure C to add this feature. Further reading: \href{https://research.swtch.com/}{blog posts about RE by Russ Cox} is what I found interesting. A discussion at \href{https://news.ycombinator.com/item?id=28243937}{HN} and \href{https://www.reddit.com/r/programming/comments/p7ulo0/fun_with_regular_expressions_part_i/}{reddit}. \levelup{}