Building a Comprehensive Skin Disease Dataset

Anie Etor-Udofia
Mar 29
2 min read

Introduction

Every AI model is only as good as its training data. For NOMA AI, creating a comprehensive and balanced dataset was the first critical challenge.

The goal was to include images spanning the full spectrum of skin conditions—from common acne to malignant melanoma—while ensuring the model would not become biased toward overrepresented classes.

Dataset Compilation

The “Skin Diseases and Cancer Comprehensive Dataset” was compiled on Kaggle using multiple public dermatology repositories.

The final dataset contains 12,900 images across 24 distinct classes, categorized as follows:

Malignant Classes (4)

Melanoma
Basal Cell Carcinoma
Squamous Cell Carcinoma
Actinic Keratosis (precancerous)

Benign Classes (20)

Acne, Eczema, Psoriasis, Rosacea, Moles
Seborrheic Keratoses, Lichen, Lupus
Vitiligo, Warts, Tinea, Candidiasis
Bullous, Vasculitis, Vascular Tumors
Drug Eruption, Infestations/Bites
Sun/Sunlight Damage, Benign Tumors

Normal Class (1)

Healthy skin tissue

The Class Imbalance Challenge

A major obstacle emerged: significant variation in class sizes.

Some benign classes (e.g., Eczema) had 1000+ images
Some malignant classes (e.g., Squamous Cell Carcinoma) had ~200 images

Without correction, the model would bias toward predicting “benign” for most cases—an unacceptable risk for a cancer screening tool.

The Solution: Computed Class Weights

To address imbalance, scikit-learn’s compute_class_weight function was used to assign weights based on class frequency:

class_counts = []
for class_name in CLASS_NAMES:
    class_path = os.path.join(dataset_path, class_name)
    num_images = len([
        f for f in os.listdir(class_path)
        if f.endswith(('.jpg', '.jpeg', '.png'))
    ])
    class_counts.append(num_images)

y_train = []
for i, count in enumerate(class_counts):
    y_train.extend([i] * count)

class_weights = compute_class_weight(
    'balanced',
    classes=np.arange(len(CLASS_NAMES)),
    y=y_train
)

class_weight_dict = {
    i: weight for i, weight in enumerate(class_weights)
}

This approach ensures the model gives greater importance to underrepresented malignant classes during training.

Data Split and Preprocessing

The dataset was split using TensorFlow’s image_dataset_from_directory:

80% Training — 10,311 images
20% Validation — 2,577 images
Random Seed: 123 (for reproducibility)

Image Preprocessing Steps

Each image was:

Resized to 224 × 224 pixels
Normalized to a [0, 1] range
Augmented with:
- Random flipping
- Rotation
- Zoom
- Contrast adjustments
- Brightness variation

Why 224 × 224?

MobileNetV3 is optimized for this input size. It provides the ideal balance between:

Preserving diagnostic features (e.g., border irregularity, color variation)
Maintaining computational efficiency on edge devices like the Raspberry Pi

Key Takeaway

A balanced dataset is not just about equal numbers—it is about ensuring the model learns to detect rare but critical conditions.

By applying class weighting, malignant classes received the attention necessary for accurate detection, improving the system’s reliability as a screening tool.