Discrete Structures for Computing Notes 11 ------------------------------------------------------------------------ Chapter 12: Modeling Computation ------------------------------------------------------------------------ Computation can be modeled with a variety of formalisms. So far in this course, we have not gone into much detail and have used pseudocode and relied on our intuition for how programs work. But sometimes this approach is not sufficiently rigorous. It is important to answer questions such as: * is it possible to solve a given problem with a computer (i.e., with an algorithm)? * what is the complexity of solving a problem? This chapter presents a few of the most common, and useful, formal models of computing: * phrase-structured grammars * finite state machines * Turing machines A common theme in the study of models of computation is how POWERFUL a particular model is. The power refers to how large is the class of problems that can be solved in a given model. Ultimately, we would like to have a model that is as powerful as the computers that we build and use every day. But sometimes it is useful to have other, weaker, models of computation. The course CPSC 433 goes into all this in much more detail. ------------------------------------------------------------------------ 12.1: Languages and Grammars ------------------------------------------------------------------------ DEF: A PHRASE-STRUCTURE GRAMMAR G = (V, T, S, P) has * set V of SYMBOLS * subset T of V called TERMINALS * distinguished nonterminal element S in V (called the START SYMBOL) * set P of PRODUCTIONS (or RULES) A production has the form alpha -> beta, where alpha and beta are both strings of symbols, and alpha contains at least one nonterminal The idea is that we start with S, and then iteratively use the productions to transform the current string of symbols into a new string of symbols. If we ever get to a string that has only terminals, then this procedure (called a DERIVATION) terminates. The set of all terminal strings that can be derived from a grammar G is called the LANGUAGE of G, denoted L(G). Different kinds of grammars put different restrictions on what the productions can look like. DEF: In a REGULAR grammar, all productions are of the form * S -> empty string or * A -> aB, where A and B are nonterminals and a is a terminal EX: Consider the grammar G = (V,T,S,P) where * V = {S,A,0,1} * T = {0,1} * S is the start symbol * P has the rules (explain the shorthand) S -> 0S | 1A | 1 | empty-string A -> 1A | 1 What kinds of terminal strings does this generate? <<< do some derivations >>> Every terminal string generated consists of some number of 0's followed by some number of 1's. I.e., L(G) = {0^m 1^n : m and n are nonneg ints}. This statement can be proved by induction on the length of the derivation. Applications of regular grammars include: * algorithms to search text for certain patterns * part of a compiler that transforms an input stream (i.e., the characters of the input program) into a stream of tokens (i.e., groups characters together into entities that have more meaning, such as "variable" or "signed integer") for use by next stage of the compiler (parser) Another notation for regular grammars is called BACKUS-NAUR FORM (BNF): EX: BNF for signed integers in decimal notation: ::= ::= + | - ::= | ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 BNF is used extensively to describe the syntax of programming languages (Java, LISP), database languages (SQL), markup languages (XML), etc. ------------------------------------------------------------------------ 12.3: Finite-State Machines with No Output ------------------------------------------------------------------------ Finite state machines (aka finite automata) are used to model processes in which there are a fixed number of different states, and we can transition from one state to another based on the input characters that are fed to the process. Certain states are designated as "accepting" states: if the machine ends in an accepting state once it has consumed its entire input, then we say the input string has been accepted. Finite state machines are the basis of algorithms for * spell checking * grammar checking (in compilers) * indexing or searching texts * speech recognition * formatting text with markup languages such as HTML and XML * network communication protocols DEF: A finite automaton M = (S,I,f,s_0,F) consists of * finite set of states S * finite input alphabet I * transition function f : S x I -> S (takes current state and current input and produces next state) * start state s_0 (drawn from S) * subset F of S (the final, or accepting, states) Easiest way to describe a finite automaton is with a state transition diagram: a directed graph in which each vertex is a state and there is an edge from vertex a to vertex b labeled with x if f(a,x) = b, i.e., there is a transition from state a to state b when the current input character is x. <<< Fig 1, p. 806 >>> DEF: A string x is ACCEPTED by M if, starting with S, the computation of M on x ends in an accepting state. The set of all strings accepted by M is the LANGUAGE of M, L(M). EX: Describe the languages accepted by these finite automata: <<< Fig 2, p. 807 >>> EX: Construct finite automata for these languages: (1) set of bit strings that begin with two 0's (2) set of bit strings that contain two consecutive 0's (3) set of bit strings that do not contain two consecutive 0's (4) set of bit strings that end with two 0's (5) set of bit strings that contain at least two 0's Construct a finite automaton for the set of all strings over {a,b} that contain abba. <<< uses "backtracking" transitions >>> The type of finite state machine defined so far is DETERMINISTIC: given the current state and the next input symbol, there is only one possible next state to go to, according to the transition function. We can also consider a variation on the finite state machine called NONDETERMINISTIC: The transition function provides a SET of possible next states, for a given current state and input symbol. The notion of a string being accepted is **different** for a nondeterministic finite state machine: a string is accepted if there exists a computation that ends in an accepting state. Even if there are other computations that don't accept, as long as there is at least one that does accept, the string is considered to be accepted. EX: <<< Fig 6, p. 812 >>> -------- It may appear that nondeterministic FAs can accept more languages than can deterministic FAs (for instance, there is a looser notion of string acceptance). However, it turns out that this is not the case: THEOREM: For every nondeterministic FA, there is an "equivalent" deterministic FA (accepts the same language). PROOF IDEA: Given an arbitrary nondeterministic FA M, we construct a deterministic FA M' that accepts the same language. The state set for M' is the powerset of the state set of M: that is, there is a single state of M' for every subset of states of M. Any state of M' that contains an accepting state of M is considered to be accepting for M'. Transitions are constructed among the states of M' in such a way as to ensure that every accepting computation of M is mimicked by an accepting computation of M'. QED Note that the equivalent deterministic machine may have many more states than the original nondeterministic machine. Virtue of nondeterministic machines is that they are often simpler to come up with. (Cf. the "abba" language from before, no need for the "backtracking" transitions.) Then if you need to be deterministic, you can run the conversion algorithm (more mechanical). ------------------------------------------------------------------------ 12.4: Language Recognition ------------------------------------------------------------------------ It turns out that there is an interesting connection between the set of languages that are generated by regular grammars and the set of languages accepted by finite state machines. Namely, they are the same. Furthermore, the set of languages represented by REGULAR EXPRESSIONS are also the same. Regular expessions are a notation for specifying sets of strings that can be created by using concatenation, union, and a special "closure" operation, called Kleene closure, starting with certain base objects. DEF: KLEENE CLOSURE of a set A, denoted A*, is U_{k=0}^infty A^k. I.e., it is the set consisting of concatenations of arbitrarily many strings from A. EX: Suppose A = {0,1}. A^0 = {lambda} (set containing a single string, the empty string) A^1 = A = {0,1} A^2 = AA, the concatenation of A and A, which is {00,01,10,11} A^3 = {000,001,010,011,100,101,110,111} ... So A* is the set of all binary strings of any length. EX: Suppose A = {ab,bc} A^0 = {lambda} A^1 = A = {ab,bc} A^2 = AA = {abab, abbc, bcab, bcbc} A^3 = AAA = {ababab, ababbc, abbcab, abbcbc,...} ... Regular expressions are formally defined recursively: DEF: Let I be a set. Basis: * emptyset-bold is a regular expression, denoting the empty set * lambda-bold is a regular expression, denoting the set {lambda} (lambda is the empty string) * x-bold is a regular expression for each x in I, denoting the set {x} Recursive: Suppose A and B are regular expressions. Then so are * AB, denoting the concatenation of the sets represented by A and B * A U B, denoting the union of the sets represented by A and B * A*, denoting the Kleene closure of A DEF: Sets represented by regular expressions are called REGULAR SETS. EX: * 10* is the set of all strings that start with a 1 which is followed by zero or more 0's * (10)* is set of all strings consisting of zero or more copies of "10" * 0 U 01 is the set consisting of the string 0 and the string 01 * (0*1)* is the set consisting of all binary strings that do not end with 0 EX: Find regular expressions for: * set of bit strings with even length ((1 U 0)(1 U 0))* or (00 U 01 U 10 U 11)* * set of bit strings that end with 0 and do not contain 11 (0 U 10)* (0 U 10) (second term is there to exclude the empty string) * set of bit strings that contain an odd number of 0's 1*(01*01*)*01* KLEENE'S THEOREM: A set is regular (represented by a regular expression) iff it is recognized by a finite state machine. PROOF SKETCH: (1) Show that any regular set is accepted by a finite state machine. Strategy: By definition, every regular set is represented by a regular expression. Show how to convert any regular expression into an "equivalent" finite state machine. Construction is recursive, to match the recursive definition of regular expressions: (2) Show that the language accepted by any finite state machine is a regular set. Strategy: Look at the state transition diagram of the finite state machine and go through a process of "condensing" it and changing the labels on the edges until getting something small enough to where the equivalent regular expression can be extracted from the edge labels. QED