![]() |
A preprocessing layer to convert raw audio signals to Mel spectrograms.
Inherits From: Layer
, Operation
tf.keras.layers.MelSpectrogram(
fft_length=2048,
sequence_stride=512,
sequence_length=None,
window='hann',
sampling_rate=16000,
num_mel_bins=128,
min_freq=20.0,
max_freq=None,
power_to_db=True,
top_db=80.0,
mag_exp=2.0,
min_power=1e-10,
ref_power=1.0,
**kwargs
)
This layer takes float32
/float64
single or batched audio signal as inputs and computes the Mel spectrogram using Short-Time Fourier Transform and Mel scaling. The input should be a 1D (unbatched) or 2D (batched) tensor representing audio signals. The output will be a 2D or 3D tensor representing Mel spectrograms.
A spectrogram is an image-like representation that shows the frequency spectrum of a signal over time. It uses x-axis to represent time, y-axis to represent frequency, and each pixel to represent intensity. Mel spectrograms are a special type of spectrogram that use the mel scale, which approximates how humans perceive sound. They are commonly used in speech and music processing tasks like speech recognition, speaker identification, and music genre classification.
References:
Examples:
Unbatched audio signal
layer = keras.layers.MelSpectrogram(num_mel_bins=64,
sampling_rate=8000,
sequence_stride=256,
fft_length=2048)
layer(keras.random.uniform(shape=(16000,))).shape
(64, 63)
Batched audio signal
layer = keras.layers.MelSpectrogram(num_mel_bins=80,
sampling_rate=8000,
sequence_stride=128,
fft_length=2048)
layer(keras.random.uniform(shape=(2, 16000))).shape
(2, 80, 125)
Input shape | |
---|---|
1D (unbatched) or 2D (batched) tensor with shape:(..., samples) . |
Output shape | |
---|---|
2D (unbatched) or 3D (batched) tensor with shape:(..., num_mel_bins, time) . |
Methods
from_config
@classmethod
from_config( config )
Creates a layer from its config.
This method is the reverse of get_config
, capable of instantiating the same layer from the config dictionary. It does not handle layer connectivity (handled by Network), nor weights (handled by set_weights
).
Args | |
---|---|
config | A Python dictionary, typically the output of get_config. |
Returns | |
---|---|
A layer instance. |
linear_to_mel_weight_matrix
linear_to_mel_weight_matrix(
num_mel_bins=20,
num_spectrogram_bins=129,
sampling_rate=8000,
lower_edge_hertz=125.0,
upper_edge_hertz=3800.0,
dtype='float32'
)
Returns a matrix to warp linear scale spectrograms to the mel scale.
Returns a weight matrix that can be used to re-weight a tensor containing num_spectrogram_bins
linearly sampled frequency information from [0, sampling_rate / 2]
into num_mel_bins
frequency information from [lower_edge_hertz, upper_edge_hertz]
on the mel scale.
This function follows the Hidden Markov Model Toolkit (HTK) convention, defining the mel scale in terms of a frequency in hertz according to the following formula:
mel(f) = 2595 * log10( 1 + f/700)
In the returned matrix, all the triangles (filterbanks) have a peak value of 1.0.
For example, the returned matrix A
can be used to right-multiply a spectrogram S
of shape [frames, num_spectrogram_bins]
of linear scale spectrum values (e.g. STFT magnitudes) to generate a "mel spectrogram" M
of shape [frames, num_mel_bins]
.
# `S` has shape [frames, num_spectrogram_bins]
# `M` has shape [frames, num_mel_bins]
M = keras.ops.matmul(S, A)
The matrix can be used with keras.ops.tensordot
to convert an arbitrary rank Tensor
of linear-scale spectral bins into the mel scale.
# S has shape [..., num_spectrogram_bins].
# M has shape [..., num_mel_bins].
M = keras.ops.tensordot(S, A, 1)
References:
Args | |
---|---|
num_mel_bins | Python int. How many bands in the resulting mel spectrum. |
num_spectrogram_bins | An integer Tensor . How many bins there are in the source spectrogram data, which is understood to be fft_size // 2 + 1 , i.e. the spectrogram only contains the nonredundant FFT bins. |
sampling_rate | An integer or float Tensor . Samples per second of the input signal used to create the spectrogram. Used to figure out the frequencies corresponding to each spectrogram bin, which dictates how they are mapped into the mel scale. |
lower_edge_hertz | Python float. Lower bound on the frequencies to be included in the mel spectrum. This corresponds to the lower edge of the lowest triangular band. |
upper_edge_hertz | Python float. The desired top edge of the highest frequency band. |
dtype | The DType of the result matrix. Must be a floating point type. |
Returns | |
---|---|
A tensor of shape [num_spectrogram_bins, num_mel_bins] . |
symbolic_call
symbolic_call(
*args, **kwargs
)