# Creating Cartoon Voices with Math

Have you ever ever wished to create a humorous or entertaining voice like a cartoon character’s voice for a get-well video, a Valentine’s video, the narration for a DVD of dwelling movies, an commercial for what you are promoting or another utility? This text tells how one can create cartoon voices utilizing arithmetic to shift the pitch of regular voices. The article consists of the Octave supply code for an Octave operate chipmunk that applies pitch shifting to audio.

The usual audio pitch shifting integrated in lots of generally used audio editors such because the free open-source Audacity editor is offered intimately. The article additionally exhibits the outcomes of utilizing a extra subtle algorithm that produces a extra pure sounding pitch-shifted voice much like the voice of the well-known cartoon character Mickey Mouse.

One of many primary ideas and strategies of sign and speech processing is the Fourier remodel, named after the French mathematician and physicist Joseph Fourier. The essential idea is that any actual operate [tex] f(x) [/tex] might be represented because the sum of the trigonometric sine and cosine capabilities. For instance, a operate [tex] f(x) [/tex] outlined on the area [tex] (0, L) [/tex] might be expanded because the sum of sines and cosines:

[tex]displaystyle f(x) = frac{a_0}{2} + sum_{n=1}^{infty} a_n cosleft(frac{npi x}{L}proper) + b_n sinleft(frac{npi x}{L}proper)[/tex]

the place the coefficients [tex] a_n [/tex] and [tex] b_n [/tex] are generally known as Fourier coefficients. It is a steady Fourier Rework.

There’s a discrete model of the Fourier Rework, typically utilized in digital sign processing:

[tex]displaystyle a_s=frac{1}{sqrt{n}}sum_{r=1}^n u_r e^{2pi i(r-1)(s-1)/n}[/tex]

the place [tex]r[/tex] is the index of an array of discrete values corresponding to audio samples, [tex] u_r [/tex] is the worth of the [tex]r[/tex]th audio pattern, [tex]s[/tex] is the index of the discrete Fourier coefficients [tex] a_s [/tex] and [tex] n [/tex] is the variety of discrete values such because the variety of audio samples in an audio “body”. The index [tex] s [/tex] is actually the frequency of the Fourier element. This model of the discrete Fourier Rework makes use of the mathematical id:

[tex]displaystyle e^{ix} = cos(x) + i sin(x) [/tex]

the place

[tex]displaystyle i = sqrt{-1} [/tex]

to mix the cosine and sine operate parts into complicated capabilities and numbers.

In audio sign processing corresponding to speech or music, the Fourier Rework has an easy which means. The sound is damaged up into a mix of frequency parts. In most instrumental music, that is quite simple. The music is a set of notes or tones with particular frequencies. Percussion devices and sure different devices can produce extra complicated sounds with many frequency parts. A spectrogram of a sign corresponding to speech or music exhibits time on the horizontal axis and the power of the frequency element on the vertical axis. That is the spectrogram of a pure 100 Hertz (cycles per second) tone:

Spectrogram of 100 Hz Tone

The spectrogram is generated utilizing the specgram operate within the Octave sign sign processing package deal by dividing the sign right into a collection of overlapping audio frames. Overlapping audio frames are ceaselessly used to attain higher time decision throughout sign processing within the Fourier area. Every audio body is windowed utilizing the Hanning window to cut back aliasing results.

The Fourier remodel is utilized to every windowed audio body, giving a collection of frequency parts, that are displayed on the vertical dimension of the spectrogram. Every frequency element is a bin in frequency masking a frequency vary equal to the audio pattern fee divided by the variety of samples within the audio body. This frequency bin measurement or frequency decision of the Fourier remodel is about 20 Hz within the spectrogram above (44100 samples per second/2048 samples in an audio body = 21.533 cycles per second). As a result of the 100 Hz tone within the instance will not be completely centered within the frequency bin spanning 100 Hz, the tone spreads out within the spectrogram, contributing to different bins as might be seen above. It is a limitation of the discrete Fourier remodel which may result in issues with sign processing corresponding to pitch shifting.

Speech has a way more complicated construction than a pure tone. In actual fact, the construction of speech stays poorly understood which is why present (2011) speech recognition programs carry out poorly in life like subject situations in comparison with human beings. This spectrogram exhibits the construction of the introduction to United States President Barack Obama‘s April 2, 2011 speech on the vitality disaster: “Hiya all people. I’m chatting with you right this moment from a UPS buyer heart in Landover, Maryland the place I got here to speak about a problem that affects households and companies similar to this one — the rising value of fuel and what we are able to…”.

President Obama on the Rising Value of Fuel

The spectrogram under exhibits the area from 0 to 600 cycles per second (Hertz). One can see a collection of bands within the spectrogram. These bands are situated at integer multiples (1, 2, 3, …) of the bottom frequency band, which is also known as F0 within the scholarly speech literature. The bands are generally known as the harmonics. F0 is named the elemental frequency. That is the frequency of vibration of the glottis which offers the driving sound for speech and is situated within the throat. The glottis vibrates at frequencies starting from as little as 80 cycles per second (Hertz) in some males to as excessive as 400 cycles per second (Hertz) in some ladies and youngsters. This elementary frequency seems to be loosely correlated with the peak of the speaker, increased for brief audio system corresponding to kids and decrease for taller ladies and men.

The basic frequency F0 fluctuates in a rhythmic sample that’s not nicely understood as individuals communicate. In some languages corresponding to Mandarin Chinese language, the altering pitch conveys which means; a phrase with rising pitch has a distinct which means from an in any other case equivalent phrase with falling pitch. In English, a rising pitch on the finish of a phrase or sentence signifies {that a} query is being requested. “The chair.” is pronounced with falling pitch whereas “The chair?” is pronounced with a rising pitch on the finish. It’s troublesome and even typically unimaginable to know English if the rhythmic sample of the elemental frequency or pitch is irregular.

President Obama on the Rising Value of Fuel (to 600 CPS)

This spectrogram exhibits President Dwight David Eisenhower saying “within the councils of presidency we should guard towards the acquisition of unwarranted affect, whether or not sought or unsought, by the navy industrial complicated” from his Farewell Deal with, January 17, 1961, most likely his most well-known phrase and his most well-known speech right this moment.

Eisenhower on the Navy Industrial Advanced

This spectrogram exhibits the spectrogram within the vary 0 to 600 Hertz (cycles per second). Once more, one can simply see the repeating bands.

Eisenhower on the Navy Industrial Advanced (to 600 CPS)

Human beings understand one thing which we name “pitch” in English which seems carefully associated to or equivalent to the middle frequency of the F0 band within the spectrogram. The F0 band will probably be increased in increased pitched audio system corresponding to many ladies and most youngsters. Each President Obama and President Eisenhower have related pitches, various between 200 and 75 Hertz with a mean of about 150 Hertz. Nonetheless, their voices sound very completely different. The F0 band might be as little as 70 or 80 Hertz (cycles per second) in just a few audio system. Former California governor and actor Arnold Schwarzenegger used a particularly low pitched voice whereas taking part in the Terminator, his most well-known position.

Usually, low pitched voices are likely to convey seriousness and typically menace whereas excessive pitched voices are likely to convey much less seriousness, though there are exceptions. The voice of the genocidal Daleks within the BBC’s Dr. Who collection is each excessive pitched and menacing on the identical time. Cartoon fashion voices might be created by shifting the pitch of regular audio system. This has been accomplished for the Alvin and the Chipmunks characters created by Ross Bagdasarian Sr.. It’s possible that some type of pitch shifting has been used over time to create among the voices of the Daleks on Dr. Who. Some robotic voices have most likely been created by combining pitch shifting with different audio results.

### Conventional Pitch Shifting

Pitch shifting predates the digital period. Within the analog audio period, one may shift the pitch of a speaker by taking part in a report or tape sooner or slower than regular. This shifts the pitch but in addition adjustments the tempo — pace or fee of talking — as nicely. One can obtain a pure pitch shift by, for instance, recording a voice performer talking at half regular pace after which taking part in the recording again at twice the traditional fee. On this case, the pitch will probably be shifted up by an element of two and the tempo or fee of talking will probably be regular. One can create the Alvin and the Chipmunks excessive pitched voice on this means utilizing analog tapes or data. One may create decrease pitched voices by appropriately combining the tempo of the unique voice and the playback fee of the recording. Though these voices are simply comprehensible, they’ve synthetic, digital qualities not present in regular low or excessive pitched audio system or voice performers deliberately making a low or excessive pitched voice. The voice of Walt Disney’s Mickey Mouse was carried out by a collection of voice artists beginning with Walt Disney himself. This excessive pitched voice sounds way more pure than the Alvin and the Chipmunks voice.

In digital audio, it’s doable to shift the pitch of the voice with out altering the tempo of the speech. This may be accomplished by manipulating the Fourier remodel of the speech, the spectrogram, and changing again to the “time area,” the precise audio samples. One can merely shift the Fourier parts from their unique frequency bin within the spectrogram to an applicable increased or decrease frequency bin. For instance, if a Fourier element is within the 100 Hz bin, one shifts this Fourier element worth to the 200 Hz bin to double the pitch. This have to be accomplished for each non-zero Fourier element. Usually, this may produce a recognizable pitch shifted voice. If the Fourier parts are usually not centered in every bin, which is often the scenario, this pitch shifted voice may have an annoying beat or modulation. It’s essential to carry out some extra mathematical acrobatics to compensate for these results to provide a comparatively easy pitch shifted voice much like the output of the analog processing described above.

This video is President Obama’s unique introduction from his April 2, 2011 speech on the vitality disaster. Click on on the pictures under to obtain or play the movies.

This video is President Obama talking together with his pitch doubled by shifting the Fourier parts however with out the mathematical acrobatics to compensate for un-centered frequency parts:

This video is President Obama talking with a chipmunked voice; his pitch has been doubled.

This video is President Obama talking with a deep voice; his pitch has been lowered to seventy p.c of regular.

Octave is a free open-source numerical programming surroundings that’s principally suitable with MATLAB. The Octave supply code under, the Octave operate chipmunk, implements the usual pitch shifting algorithm in widespread use. The Octave code requires each Octave and the Octave Forge sign sign processing package deal for the specgram operate which computes the spectrogram of the sign.

The movies on this article have been created by downloading the unique MPEG-4 movies from the White Home web page and splitting the audio and video right into a MS WAVE file and a sequence of JPEG nonetheless photos utilizing the FFMPEG utility. Presidential speeches and video are within the public area in the USA. The unique nonetheless photos have been shrunk by half utilizing the ImageMagick convert utility. The audio was pitch shifted in Octave utilizing the chipmunk operate under. The brand new audio and video have been recombined into the MPEG-4 movies on this article by once more utilizing the FFMPEG utility. Variants of this pitch shifting algorithm might be discovered in lots of applications together with the broadly used free open-source Audacity audio editor (the Audacity pitch shifting algorithm could also be barely completely different from the algorithm applied under):

```
operate [ofilename, new_phase, output] = chipmunk(filename, pitchShift, fftSize, numberOverlaps, thresholdFactor)
% [ofilename, new_phase, output] = chipmunk(filename [,pitchShift , fftSize, numberOverlaps, thresholdFactor]);
%
% chipmunk audio impact (as in Alvin and the Chipmunks)
%
% ofilename -- title of output file with pitch shifted audio
% new_phase -- the recomputed phases for the pitch shift audio (for debugging)
% output -- the pitch shifted audio samples
%
% arguments:
%
% filename -- enter file title (MS Wave audio file)
% pitchShift -- frequency/pitch shift (default=2.0)
% fftSize -- measurement of FFT (default = 2048)
% numberOverlaps -- variety of overlaps (default = 4)
% thresholdFactor -- threshold issue for zeroing silence frames
%
% \$Id: chipmunk.m 1.44 2011/08/04 01:25:35 default Exp default \$
% (C) 2011 John F. McGowan, Ph.D.
% E-Mail: [email protected]
% Net: https://www.jmcgowan.com/
%

if nargin  1
raw_data = information(:,1); % enter is stereo with 2 channels in 2 columns of array
else
raw_data = information; % mono sound enter
finish
information = [];
clear information; % free reminiscence

mx_input = max(abs(raw_data(:)));

printf("making use of fftn");
fflush(stdout);
%spectrogram = fft(spectrogram);

overlap = fftSize - stepSize;
printf("stepSize: %d overlap is %dn", stepSize, overlap);
fflush(stdout);

nsamples = size(raw_data);

% hanning window
window = hanning(fftSize); % window the output
window = (numel(window)/sum(window(:)) )*window; % normalize the window

% use Octave sign package deal specgram operate to use fft to windowed overlapping frames
% [] signifies default window (hanning)
%
[spectrogram, f, t] = specgram(raw_data, fftSize, sampleRate, window, overlap);

printf("spectrogram has dimensions %d %dn", rows(spectrogram), columns(spectrogram));
fflush(stdout);

% free reminiscence
raw_data = [];
clear raw_data;

depth = dot(spectrogram, spectrogram, 1); % every column is an audio body
max_intensity = max(depth(:));
threshold = thresholdFactor*max_intensity;

speech_frames = depth > threshold;

printf("speech_frames has dimensions: %d %d n", rows(speech_frames), columns(speech_frames));
fflush(stdout);

printf("zeroing silence frames...n");
fflush(stdout);

speech_frames = repmat(speech_frames,rows(spectrogram), 1);

spectrogram = spectrogram .* speech_frames;

printf("dimensions spectrogram at the moment are: %d %d n", rows(spectrogram), columns(spectrogram));
fflush(stdout);

printf("computing section...n");
fflush(stdout);

% spectrogram is half-array with out duplicate fft coefficients
% 1:fftSize/2 rows, quantity time steps columns
% every row is an fft coefficient
%
magn = 2.*abs( spectrogram ); % magnitude of fft coefficients
section = arg( spectrogram ); % section of fft coefficients

previous_phase = zeros(measurement(section));
previous_phase(:,2:finish) = section(:,1:end-1);

phaseShifts = (0:(fftSize/2)-1)*phaseShift; % anticipated section shift if frequency element is centered in bin
phaseShifts = repmat(phaseShifts', 1, columns(section));

spec_buf = section - previous_phase; % change in section from earlier time step
spec_buf = spec_buf - phaseShifts; % distinction between change in section and anticipated section change
% if frequency element is centered in frequency bin

fflush(stdout);
% deal with mapping to -pi to pi vary of atan2/arg (under)
phase_adjust = uint32(spec_buf./pi); % 0 if spec_buf between -pi and pi

spec_buf = numberOverlaps*spec_buf./(2*pi);

printf("computing corrected frequenciesn");
fflush(stdout);
% compute corrected frequency
frequencies = repmat(f',1,columns(spectrogram)); % f is row vector when returned by specgram

spec_buf = frequencies + spec_buf*freq_resolution;

corrected_freq = spec_buf;

printf("making use of frequency shiftn");
fflush(stdout);

shifted_magn = zeros(measurement(magn));
shifted_freq = zeros(measurement(corrected_freq));

oldTime = time;
for ok = 1:fftSize/2
ind = uint32((k-1)*pitchShift) + 1;
if (ind  1)
pct = (ok / fftSize)*100.0; % p.c progress
printf("frequency shift: processed %3.1f%% %d/%dn", pct, ok, fftSize);
fflush(stdout);
oldTime = time;
finish % finish if
finish

%shifted_freq = corrected_freq * pitchShift;

% now convert from magazine and freq to magazine and section
%
printf("computing new phasen");
fflush(stdout);

spec_buf = zeros(measurement(spectrogram)); % be certain that begin with zeros

printf("new section: assigning shifted frequenciesn");
fflush(stdout);

spec_buf(2:finish,:) = shifted_freq(2:finish,:);

printf("new section: subtracting heart frequenciesn");
fflush(stdout);

spec_buf(2:finish,:) = spec_buf(2:finish,:) - (frequencies(2:finish,:) );

printf("new section: dividing by frequency resolutionn");
fflush(stdout);

spec_buf(2:finish,:) /= freq_resolution;

fflush(stdout);

spec_buf(2:finish,:) = 2.*pi*spec_buf(2:finish,:)/numberOverlaps;

printf("new section: computing delta phasen");
fflush(stdout);

delta_phase = spec_buf + phaseShifts;

%delta_phase = phaseShifts;

new_phase = delta_phase;

printf("new section: including delta phasen");
fflush(stdout);

%new_phase = spec_buf;
new_phase = zeros(measurement(spec_buf));
% % %new_phase(:,1) = spec_buf(:,1);
% % %dc coefficient has no section (at all times a non-negative actual)
oldTime = time;
ncols = columns(spec_buf);
for i = 2:ncols
new_phase(2:finish,i) = new_phase(2:finish,i-1) + delta_phase(2:finish,i-1);
newTime = time;
deltaTime = newTime - oldTime;
if (deltaTime > 1)
pct = (i / ncols)*100.0; % p.c progress
printf("new section: processed %3.1f%% %d/%dn", pct, ok, fftSize);
fflush(stdout);
oldTime = time;
finish % finish if
finish

spec_buf = [];
clear spec_buf; % free reminiscence

new_spectrogram = zeros(fftSize, columns(spectrogram)); % allocate full fft array for inverse fft

new_spectrogram(1,:) = shifted_magn(1,:); % dc coefficient
new_spectrogram(2:fftSize/2,:) = shifted_magn(2:finish,:).*cos(new_phase(2:finish,:)) + i*shifted_magn(2:finish,:).*sin(new_phase(2:finish,:));

new_spectrogram(fftSize/2 + 2:finish,:) = conj(flipud(new_spectrogram(2:fftSize/2,:))); % replicate fft coefficients

spectrogram = [];
clear spectrogram;

% INVERSE FFT
%
printf("making use of inverse fftn");
fflush(stdout);

new_data = actual(ifft(new_spectrogram))/fftSize;

printf("dimensions new_data are %d %dn", rows(new_data), columns(new_data));
fflush(stdout);

new_spectrogram = [];
clear new_spectrogram;

% every column is an audio body which can overlap with earlier audio body by overlap samples
%

iframe = 1; % begin at body 1
it = 1; % begin at first pattern of output
output = zeros(nsamples,1); % all rows, 1 column

printf("making use of overlap and add...n");
fflush(stdout);

whereas( (it+fftSize-1)  1.0
scale_factor = mx / mx_input;
printf("scaling output by %fn", 1.0/scale_factor);
fflush(stdout);
output = output / scale_factor;
finish

printf("writing shifted audio to %sn", ofilename);
fflush(stdout);
%
wavwrite(output, sampleRate, bits, ofilename);

disp('ALL DONE');
finish % operate
%
```

The screenshot under exhibits working the chipmunk operate in Octave 3.2.4 on a PC below Home windows XP Service Pack 2 (Click on on the screenshot picture to see the complete measurement screenshot). This screenshot exhibits the operate known as from the Octave immediate utilizing the default values of the operate’s arguments. The argument numberOverlaps controls the arithmetic to compensate for the uncentered frequency parts. If numberOverlaps is one, there isn’t any compensation. The bigger numberOverlaps, the more practical the compensation. The extra overlaps, the extra pc time and assets required by the pitch shifting. A price of numberOverlaps of thirty-two (32) was used to pitch shift President Obama’s voice within the video above.

Operating the Chipmunk Perform in Octave

Though simply comprehensible, these pitch-shifted voices sound considerably synthetic. Certainly, this synthetic high quality is a part of the enchantment of the Alvin and the Chipmunk voice.

### Pitch Shifting Will get Higher

Pitch shifting algorithms have improved. It’s now doable to provide voices that sound way more like pure voices on the desired new pitch, similar to the voice of Mickey Mouse. This video is President Obama talking with a voice much like the voice of Mickey Mouse: