ABSTRACT: WAVEFORM MODELS FOR DATA-DRIVEN SPEECH SYNTHESIS Michael W. Macon Department of Electrical and Computer Engineering Center for Spoken Language Understanding Oregon Graduate Institute Portland, OR USA macon@ece.ogi.edu cslu.cse.ogi.edu/tts Many "data-driven" models for synthesizing speech rely on concatenating waveforms extracted from a database. However, the number of perceptually-important degrees of freedom in speech make it unlikely that enough data could be collected to cover all combinations of phonetic variables. By utilizing models that can transform the waveform in perceptually-relevant ways, the space of acoustic features covered by the data can be expanded. The minimal requirement for such a model is parametric control of the fundamental frequency and duration. In addition to this, dimensions such as voice quality characteristics (breathiness, creak, etc.), phonetic reduction, and voice identity can be altered to expand the range of effects realizable from a given database. A few classes of models have been proposed to allow varying degrees of control over these dimensions. Tradeoffs between flexibility, fidelity, and computational cost exist with each. This paper will describe common threads running through the best-known approaches, including handling of non-periodic components and sensitivity to measurement errors. Although each of these approaches has its own merits, none have been proven to simultaneously satisfy all the desired properties for synthesis. We will demonstrate the limits of some of the simplifying assumptions underlying these methods, and outline areas needing improvement.