Modeling and Analysis of Communication Circuit Performance using Markov Chains and Efficient Graph Representations Alper Demir Peter Feldmann alpdemir, peterf @research.bell-labs.com Bell Laboratories Murray Hill, New Jersey, USA Abstract In high-speed data networks, the bit-error-rate specification on the sys- tem can be very stringent, i.e., 10 14 . At such error rates, it is not feasi- ble to evaluate the performance of a design using straightforward, simula- tion based, approaches. Nevertheless performance prediction before actual hardware is built is essential for the design process. This work introduces a stochastic model and an analysis-based, non- Monte-Carlo method for performance evaluation of digital data communi- cation circuits. The analyzed circuit is modeled by a number of interacting finite state machines with inputs described as functions on a Markov chain state-space. The composition of these elements results in a typically very large Markov chain. System performance measures, such as probability of bit errors and rate of synchronization loss, can be evaluated by solving linear problems involving the large Markov chain’s transition probability matrix. This paper first describes a dedicated multi-grid method used to solve these very large linear problems. The principal bottleneck in such an approach is the size of the Markov chain state-space, which grows expo- nentially with system complexity. The second part of this paper introduces a novel, graph based, data structure capable of efficiently storing and ma- nipulating transition probability matrices for several million state Markov chains. The methods are illustrated on a real industrial clock-recovery cir- cuit design. 1 Introduction High-speed communication systems have extremely tight bit-error-rate (BER) specifications. For SONET/SDH applications it is not uncom- mon to have BER requirements in the order of 10 14 . Such speci- fications are practically impossible to verify through straightforward simulation because of the extremely long sequence that would need to be simulated in order to get meaningful error statistics. In the absence of a performance analysis tool, designers rely on the experience of pre- vious designs, intuition, and good luck. This environment discourages innovative solutions and non-incremental applications. On the other hand, the design process of communication systems would benefit significantly from the existence of a reliable design per- formance evaluation capability. Such a capability would permit the evaluation of a number of alternative algorithms, architectures, circuit techniques, and technologies in a short time and without the commit- ment of expensive resources. A situation that illustrates the need for a reliable evaluation capabil- ity of the BER occurred in the design of a SONET-type application at a well-known micro-electronics company. The specification for a mul- tiplexer chip required a BER of 10 14 . The prototype implementation, based on the modification of an existing design delivered performance that was more than an order of magnitude bellow the specification. The designers suspected that the main cause for the errors is the in- terference noise in the PLL-based clock recovery circuit, induced by the rest of the chip’s circuitry. A number of circuit, technology, and packaging remedies were proposed, but the designers were frustrated by their inability to predict their effectiveness. This paper introduces a method for performance evaluation of digi- tal data communication circuit designs. Our analysis method computes the probability of errors directly from the design description, without relying on the simulation of long sequences. The system under eval- uation is described as a number of finite-state machines (FSMs) with some of their inputs being random. The random processes describe stochastic models for the incoming data, noise, and jitter. The random inputs are modeled as functions on the state-space of Markov chains. It is shown that under these circumstances the entire system can be modeled by a larger Markov chain. The quantities of interest for our system, such as the probability of a detection error, or the mean time between failures due to detection errors are thus available from stan- dard Markov chain analysis. The first challenge is to develop numerical methods capable of handling the extremely large transition probability matrices associated with Markov chains that can easily reach millions of states for moder- ately complex systems. In this work we employ a specialized multi- grid method which takes advantage of the underlying problem struc- ture and is capable of solving million state problems in less than an hour on a beefed-up workstation. The remaining challenge is to store and perform computations with the extremely large Transition Probability Matrices (TPMs) associ- ated with Markov chains that can easily reach millions of states for moderately complex systems. For this purpose we introduce a novel graph-based data structure called Conditionally Ordered Conditional Probability Decision Graphs (COCPDGs). COCPDGs are capable of storing and performing operations efficiently with TPMs resulting from multi-million size chain state spaces. In contrast to alternative data structures, proposed in the past, COCPDGs are efficient for any practical interconnection structure of the model FSMs. COCPDG storage requirements typically grow sub-linearly with the size of the Markov chain state-space. The cost of computing the product of a COCPDG-encoded TPM with an arbitrary unstructured vector is lin- ear in the size of the vector. Multiplication of the COCPDG-encoded TPM with structured and graph encoded vectors can be performed in sub-linear time, but the use of structured vectors severely limits the choice of the numerical method for eigen computations and linear sys- tem solutions. In this work we demonstrate the use of the COCPDG to encode a TPM resulting from the modeling of a real industrial clock and data recovery circuit. For one example, the Markov chain has more than two million states and the TPM has 1.35 billion non-zeros. With our data structure we encode the matrix with about 160 MBytes and perform a matrix-vector multiply in approximately 20 secs. The clock and data recovery circuit performance measures are computed through a simple power iteration in several hours. 2 Modeling and Performance Evaluation Throughout the paper, we will be using the CDR circuit [1, 2] shown in Figure 1 to illustrate the stochastic model and the performance eval- uation techniques. The framework we present here is by no means restricted to this particular circuit, and the general model we describe can be used for other discrete-time mixed-signal processing circuits. The CDR circuit in Figure 1 consists of two coupled feedback loops. The first one (upper left) is a traditional “analog” charge- pump phase-locked loop (PLL) with a crystal reference and a voltage- controlled oscillator (VCO) that can generate multi-phase clocks (e.g., a ring-oscillator). The second loop (lower right) is digital, and has the purpose of selecting “the best” of the clock phases generated by the first loop in order to retime/align the data. This phase selection is continually updated by the loop. The currently selected phase and the incoming data are “compared” in the phase detector (PD) which