Acoustic Echo Cancellation

The Problem

Teleconferencing with a sound reinforcement system poses a difficult problem caused by coupling between loudspeakers and microphones located in the same room. Sound from the far side of the conference is delivered into the room through local loudspeakers and enters the local microphones along with the sound from the local participant voices causing an echo to be heard at the far side.


In addition to being mixed with local talker’s voices, reflections off of surfaces in the room, some of which are delayed by longer path lengths, also mix with the sound from the loudspeakers. If this echo-contaminated signal is sent to the far side, they will hear themselves as an echo along with the sound of the near side talkers.

The sound from the loudspeakers is attenuated by the loss due to the distance between them and the microphones, and reflections in the room are absorbed by acoustical treatment in the building materials. This loss in level is called ERL (echo return loss).

Even with best practices in building construction materials and sound system design, a significant amount of far side sound will be picked up by the microphones. Digital processes to further remove far side sound are called ERLE (echo return loss enhancement). The most common of these is a digital process called AEC (acoustic echo cancellation).  AEC is a DSP-based process used to remove as much of the sound from the local loudspeakers as possible from the signal that is sent to the far side. When the AEC processing identifies and removes the echo, it is said to have converged.

The total amount of echo suppression is the sum of ERL and ERLE. For example, if a room has a natural ERL of 15 dB and the ERLE (AEC cancellation) averages 25 dB, the total echo suppression is 40 dB. The amount of echo suppression varies as the system is operating when gain values in the local sound system are changed and the echo return paths change when different microphones are used or moved. These changes make is more difficult for the AEC to converge and remain converged at a deep enough level to effectively remove audible echo heard at the far side of the conference.

The AEC uses the far side received audio as a reference signal that is fed to the AEC so that it can be identified and removed from the local signal that is to be sent to the far side. The difficulty with this is the fact that during the coupling from the loudspeakers to the microphones, the signal is modified by reflections in the room and non-signal noise, which is depicted as EPF (echo pass filter) in the diagram.

The echo-contaminated far side signal is mixed with the sound from the local talkers and becomes the input to the AEC. The job of the AEC is to construct a digital filter that can be applied to remove the far side signal (echo) before the signal is sent to the far side. This filter is depicted as ERF (echo reconstruction filter) in the diagram.


The “magic” in the process takes place in the Adaptation Processor where an advanced DSP algorithm is continuously monitoring the effectiveness of the ERF and updating it as needed to remove as much of the echo as possible.


ASPEN conference processors employ a proprietary AEC (US Patent Pending) that is extremely fast converging, will not lose convergence during double-talk (both far and near sides equally active), and will continue to deepen the convergence with every tiny opportunity where the far side audio is dominant over the near side audio. The AEC is so robust, in fact, that it can handle any number of microphone input channels, all mixed with the patented, gain proportional auto mixing algorithm.*

This unique AEC makes an ASPEN system scalable so that any number of inputs can be added without having to purchase additional DSP processing power.

ASPEN AEC Performance

This illustration was created from an actual audio conference recording while the ERLE convergence depth was plotted along with the audio from both sides. The recording is 30 seconds in length and the illustration includes four different segments that demonstrate the effectiveness of the ASPEN AEC in a real world situation.

ERLE composite 11x17

[1] In the first segment, the far side signal is dominant and the AEC converges to an ERLE depth of 24 dB within 1.5 seconds. Then it picks up another 2 dB and maintains the convergence depth for another few seconds.

[2] At 10 seconds into the recording, a microphone is moved, which changes the path length between the loudspeaker and microphone. This requires that the AEC re-converge, which it does to a depth of a little over 20 dB, then maintains the convergence as the conversation moves to the near side being dominant.

[3] At just over 13 seconds into the conversation, the activity moves into what is called double talk where both near and far sides are talking at the same time and at similar levels. The AEC maintains the convergence depth during this period.

[4] At about 24 seconds into the recording, there is a brief pause at both sides, followed by the far side again becoming dominant. This allows the AEC to increase the convergence depth with brief peaks in the far side signal. This attribute of the AEC is evident at 26 seconds into the recording when there is a brief peak in the far side audio that coincides with an increase in the convergence depth.