How YAMNet Works: From Sound Waves to 1024-D Embeddings

DetectED
Apr 6
3 min read

Teaching a Computer to Listen to Lungs

Doctors have used stethoscopes for over 200 years. But not every community has access to a pulmonologist. What if a computer could listen to lung sounds and identify diseases like asthma, COPD, or pneumonia?

That's what YAMNet helps us do.

What Is YAMNet?

YAMNet is a pre-trained deep neural network developed by Google. It was trained on millions of audio clips to recognize 521 different sound events—from speech and music to dog barks and, importantly, medical sounds.

We use it as a feature extractor for lung sounds.

Why Not Train from Scratch?

Training a neural network to understand audio requires:

Hundreds of thousands of labeled audio samples
Weeks of training time on powerful GPUs
Extensive expertise in audio deep learning

YAMNet gives us a shortcut. It already knows how to extract meaningful features from sound. We just "fine-tune" it for our specific task (lung disease classification).

How YAMNet Processes Audio

Here's the pipeline step by step:

Step 1: Preprocessing

Raw audio is recorded at 16,000 samples per second (16 kHz). We record 3 seconds → 48,000 samples.

YAMNet expects audio at 16 kHz, normalized to the range [-1, 1].

Step 2: Mel Spectrogram

The raw waveform is converted into a mel spectrogram—a visual representation of sound.

X-axis = time (frames, each ~0.96 seconds)
Y-axis = frequency (96 mel bins, which approximate human hearing)
Color = intensity (how loud at that frequency and time)

Wheezes appear as horizontal bands. Crackles appear as vertical spikes.

Step 3: MobileNetV1 Backbone

YAMNet uses MobileNetV1, a lightweight convolutional neural network designed to run efficiently on mobile devices and Raspberry Pis.

The spectrogram passes through multiple convolutional layers that learn to detect:

Edge-like patterns (sudden changes in sound)
Textures (repetitive patterns like wheezes)
Higher-level features (combinations of patterns)

Step 4: Embedding Extraction

At the final layer before classification, YAMNet outputs a 1024-dimensional embedding vector for each time frame.

Think of an embedding as a "fingerprint" of the sound. Similar sounds have similar embeddings.

We take the mean embedding across all time frames—this gives us a single 1024-D vector that represents the entire 3-second recording.

Why 1024 Dimensions?

Each dimension captures a different aspect of the sound. Some dimensions might detect wheezing. Others might detect crackles. Others might detect breath intensity or rhythm.

We don't need to know what each dimension means—the neural network learns the useful combinations automatically.

Step 5: Our Classifier

The 1024-D embedding is fed into our own small neural network:

text

Input (1024) → Dense(512, ReLU) → Dropout(0.4) → BatchNorm
           → Dense(256, ReLU) → Dropout(0.3) → BatchNorm
           → Dense(128, ReLU) → Dropout(0.2)
           → Dense(5, Softmax) → Output probabilities

The final layer outputs probabilities for five classes: asthma, COPD, pneumonia, healthy, and bronchial.

Why This Works (Transfer Learning)

YAMNet was trained on a massive dataset (AudioSet) with 521 classes. It learned general sound features that apply to almost any audio task.

By using YAMNet as a fixed feature extractor (we don't retrain it), we leverage all that learning for free. We only train the small classifier on top—which requires much less data and compute.

Our Results

With this approach, we achieved 86.4% accuracy on 5-class lung disease classification—comparable to an experienced pulmonologist's auscultation accuracy (85–90%).

Class	Precision	Recall	F1-Score
COPD	95%	93%	94%
Asthma	78%	91%	84%
Pneumonia	87%	79%	83%
Healthy	81%	81%	81%
Bronchial	89%	76%	82%

The Clinical Connection

The model is particularly strong at detecting COPD (94% F1)—a critical disease for early intervention. It's also good at detecting pneumonia, though with slightly lower recall (79%).

The confusion between asthma and pneumonia makes clinical sense—both conditions can present with wheezing. This isn't a bug; it reflects the reality that respiratory diseases have overlapping symptoms.

What's Next?

Now that we have acoustic features (5-class probabilities) and microwave features (840 dimensions), we can fuse them together. The next post explains how XGBoost makes this fusion possible.