Software Architecture

Anyone familiar with computing and telephony in 2025 may find the state of computing and telephony in 1985 and the following decade almost incomprehensible. Telephone channels were 64Kbps voice (or data encoded as analog), basic telephone “signalling” was ringing, a few audio tones for the human listener, and the phone on-hook or off-hook. Computers had very limited CPU “horsepower,” small memory, and limited disk space. Digitizing and playing back digitized speech were compute intensive but only a fraction of the computing power required by automatic speech recognition (ASR) or text-to-speech translation (TTS). Those required specialized computing resources. Internet telephony and the Internet (“world wide web”) itself were years away.

With the limitations of the time, Conversant had a remarkably forward looking hardware and software architecture. As recounted elsewhere, the hardware distributed functions to “intelligent” circuit packs dedicated to handling specific types of telephone connections (lines and trunks with firmware to handle the protocol required for the connection) and signal processing tasks such as speech recording, playback, ASR, and TTS. Although many of the lowest level tasks could be successfully off-loaded to the circuit pack firmware, the overall control of the system relied on a single computing platform (CPU, RAM, and disk). All of these were tied together with a computer bus (GPIB for the Conversant 1; ISA and PCI for later platforms) and a telephony “bus” or switching fabric that connected telephone voice paths between circuit cards under the control of the main CPU.

Unlike many computing tasks where the application can be started, run through its logic and input/output operations, and terminate when it has completed its logic, handling telephone calls is somewhat different. The “application” that handles a phone call must be started when a call arrives. It can interact with the caller performing input (getting touch-tone signals or spoken input) and output (playing prerecorded speech or generating TTS), but it is subject to the whims of the caller. The caller may hang up. The caller may not speak or press tones when requested. The phone line may drop. All of these possible events must be anticipated. This means an IVR platform is an inherently event driven system.

No cost effective IVR platform would handle a single phone call or even a very limited number of phone calls simultaneously. All of the various “events” (calls arriving; calls leaving; speech being fetched from disk, converted into analog, & played to a caller; speech being received from the caller, converted into digital, & recorded to disk or fed to a speech recognizer; etc.) must be handled, and handled in real time. Like a juggler attempting to keep five or seven balls aloft, each “event” must be handled when it needs service, or things fall apart. The juggler’s balls come crashing down. An IVR will drop calls, fail to respond to a caller, provide strangely “broken” speech, etc. In the worst case, the operating system will “crash” and the entire system stops working.

The challenge of a successful IVR is to orchestrate all of these activities, the timing of most not directly under the IVR’s control, and handle as many simultaneous calls as possible with the resources provided. The Conversant software architecture found several elegant solutions to these inherent problems that would lead to the platforms’ long life and commercial success. (see Perdue & Rissanen, 1986 for more detail)

Virtual Machine Architecture

The first novelty was to conceive of the “voice system” as a “virtual voice computer.” Although the idea of using one computer to emulate another computer (a “virtual machine” if you will) was not new, software systems routinely relying on virtual machines were a few decades away. If the underlying computing platform had been powerful enough, it might have been possible to have each channel run its own “virtual machine,” but that wasn’t possible. It would be necessary to create a small number of long-lived Unix processes that would serve as the voice computer emulator. Thus, an interesting set of processes arose: TRIP, SPIP, TWIP, VROP, TSM, etc. Each process handled a specific part of the voice system emulation. For example, the IP’s were the “Input Process” handlers. TRIP was responsible for interacting with the tip-ring cards (6-port analog “plain old telephone” or POTs connections). TWIP was the process that handled T1 cards. These processes would interact with the firmware running on the relevant circuit cards and shuttle events and actions between the “application” running in the voice computer and the hardware supporting the telephone channel that application was serving.

Transaction State Machine

Application Scripts

The heart of this voice computer, effectively its CPU and operating system, was a process called TSM. TSM stood for “transaction state machine.” The “application” that would run within the voice computer when a call was active on a telephone channel was a “transaction script.” Conceptually, voice applications have a relatively simple structure. Although there may be some programming logic needed to fetch and process resources such as text to convert to TTS, pre-recorded speech files to stitch into a coherent spoken output, backend data to be fetched for auditory delivery, etc., voice applications are an interactive dialog between the caller and the voice computer. These voice applications were conceived to be a “script” — a call flow that indicated the desired dialog with appropriate branches and error legs.

The actions making up a script were the TSM instructions. These “instructions” were effectively the machine code (instruction set) for the virtual voice computer. This was a rather short list of actions such as ttalk( ), getdig( ), etc. and flow control components such as jmp( ), case( ), rts( ), etc. The virtual voice CPU contained a number of registers. Constants and references to registers were supported. Because this was implemented at the level of machine code, it was called “transaction assembly” language or TAS. The syntax for TAS was vaguely similar to the C programming language.

Asynchronous Distributed Processing

Computer applications execute in a stop & go fashion. A CPU will typically execute a sequence of “machine instructions” at the CPU’s native speed. Once it reaches a need to interact with the external environment (request an input or send an output), these operations are typically orders of magnitude slower than CPU instructions. One of two things must happen. Either the CPU repeatedly executes some non-instruction (a “no op” instruction) killing time until the input/output completes, or the CPU needs to suspend the current application and take up some other application. This form of application time sharing is an inherent function of operating systems and allows a computer to be effectively shared between several applications and/or several users.

State Machines

As the voice computer’s “operating system,” TSM implemented this form of time sharing between transaction scripts. Once a voice application requested a caller’s input or set in motion the process of playing recorded speech to the caller, it could suspend the execution of that application until the activity had completed. The way that TSM kept track of which voice scripts were executing and where each was in its execution was to treat the script as a “state machine.” Each script would currently exist in some state. Based on events that might reach TSM, TSM would know how to treat the application script executing its next set of TAS instructions and thus moving it to a new state. The voice computer emulation could handle multiple telephone channels (calls) simultaneously and perform the correct next step for each call’s dialog.

Time Handling

Physical computer platforms are run based on a clock. The sequence of clock pulses drive the CPU and keep the computer executing. An emulated voice computer also needs a “clock” that drives its functioning. In the case of Conversant, the implementation of the voice computer’s clock was another clever choice that made Conversant quite resilient.

Event driven applications such as TSM voice scripts have a potential fatal weakness. When an input/output operation is initiated and the application enters a “suspended state” waiting for the operation to complete, there is the possibility that it may never complete. Leaving an application suspended indefinitely is a problem. The “standard” approach to prevent this form of deadlock is to set a timer. Every operation is guaranteed to either return completed (successful or unsuccessful) or the timer expires. Regardless of the outcome, one of these will occur and the application can proceed even if that is some error handling or final cleanup. This form of execution was inherent in TSM instructions. Some instructions were “wait causing,” e.g. they would place the TSM script into a wait state and set a timer, or they would simply be executed and the TSM script would proceed.

It’s important to realize that there were two levels of this asynchronous processing going on. The actual computing platform that Conversant ran on had a CPU, a clock, interrupts and event processing, and a set of Unix operating system processes that needed to be kept running reliably. The core Conversant Unix processes implemented the virtual voice computer with its own “CPU” (TSM’s execution function), clock (the voice system’s clocking mechanism), interrupts & events (described shortly), and the current set of executing TSM scripts. These virtual applications needed to be kept running reliably to ensure that calls were answered and handled correctly.

The “events” within the voice computer were implemented as “messages.” This is another standard mechanism in programming distributed, event driven applications, but it was used to exceptionally good effect within Conversant. Each of the core processes (the TRIP, SPIP, VROP, etc.) interacted directly with the underlying hardware. When activities or events at the hardware level needed to be communicated into a voice script, the handler process, for example TRIP passing up some touch tones collected from a caller, would place a message in a message queue for the other processes that needed it. Each of these processes primary function was to attend to its input queues, recognize when a message arrived, process the message as quickly as possible, and return to waiting for the next message. For example, TRIP might be getting various events from the tip-ring card firmware. One channel might be sending up touch tones. Another may have ringing occurring (a new call arriving). Each of these events would cause TRIP to generate an appropriate message specific to the channel it occurred on. If digitized speech was ready to be delivered to one of TRIP’s channels, it might receive a message from SPIP coordinating the delivery of binary speech data over the switching fabric to the appropriate TRIP channel. TRIP would process that message from its queue. With all of these core processes executing their functions by passing messages to other processes and consuming messages that required interaction with specific hardware, the voice computer could handle multiple telephone channels with a single control computer and a single set of Unix core processes.

Part of the complexity of creating a reliable product is ensuring that each of the core processes works correctly and when problems arise, corrective action is automatically taken. If a core process dies, it must be immediately restarted and be able to pick up exactly where the previous process left off. If the system is experiencing “strange” behavior, the system needs to log problems and also raise alarms so that troubleshooting and correction can be initiated by the people administering the system. All of this was built into Conversant — that was part of the heritage of AT&T and Bell Labs’ approach to product development.

The interesting and slightly novel aspect of Conversant’s emulated voice computer was the way the voice system was “timed.” Each TSM “wait causing instruction” would set a time limit for the instruction to complete. However, this time limit wasn’t timed by the actual time of the underlying computer but by “timing messages” placed into the various core process queues. Since each of the core processes main function was to consume its message queues, the core processes recognized the passage of time in the voice computer by encountering the next timing message. If sufficient timing messages were encountered before an expected result was returned, the wait causing instruction had “timed out.” If a result showed up later, it was simply discarded. That event “took too long.” This had the following slightly counterintuitive, but highly beneficial, effect. Some forms of delay were handled gracefully. The voice system might get a bit behind, but it would subsequently “catch up.”

At the level of the underlying physical computer, if the applications (processes) being executed overtax the processing resources, the CPU may get behind. External events will occur regardless the load on the CPU. External events will generate interrupts. Interrupt handlers will try to process them and post their results. And, the computer’s operating system will attempt to keep all of the processing “balls” in the air. If it becomes too overloaded, some processing simply won’t happen fast enough. Interrupts will be missed or their results will occur in the wrong places. The operating system and the processes running on the CPU will get out of sync and the system will likely crash. Conversant made this catastrophic failure less likely to happen by distributing the processing between the main computer (as the central controller) and intelligent telephone peripheral cards.

Within the emulated voice computer, the decision to make all of the cooperating processing dependent on messages including timing messages meant that when the voice system became overloaded, the core processes simply got behind processing their message queues. If too many telephone channels simultaneously needed service, the core processes would get behind. Processing of message queues would bog down and the “flow” of the voice applications would get behind. The result might be aberrant behavior from the caller’s viewpoint (no speech got played and the caller heard a long pause, a message was interrupted mid-play so that an unexpected silence distorted a spoken message, a back end data access took an excessively long time, etc.) If the load condition was brief and transient, callers might simply experience an undesirable pause at some point. If the load was longer and more severe, callers might begin to find the interaction unacceptable and hang up. Although these were undesirable events, the voice computer didn’t “crash.” It would bog down but as conditions cleared, it could pick up and go on.

The inherent flexibility and stability of Conversant allowed it to handle many simultaneous channels. As the underlying computer platform became more powerful, the platform progressed from the low number of channels supported by Conversant 1 to more than a hundred channels in later releases such as Conversant versions 6, 7, and 8. However, there was always an interesting problem — telling an enterprise purchasing a Conversant a priori how many channels his or her Conversant could handle. There was never a simple answer. How many channels a given Conversant platform could reliably handle was always a function of the resources needed by the voice scripts that would be running, how many calls would be simultaneously active, and how quickly these calls came and went.