Tutorials

Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs

Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs

Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs

Sep 4, 2025

25 min read

In this tutorial, we'll walk through the complete process of training a specialized vision model on the DataWiz platform that can analyze food images and extract structured information including dish names, ingredients, nutritional profiles, and portion sizes. By fine-tuning a smaller Gemma 3 4B model on domain-specific data, we'll create a model that not only outperforms larger generic models like GPT-4 but also runs faster and more cost-effectively.

Tutorial Outline

  1. Dataset Preparation - Using the MMFood100K dataset from Hugging Face

  2. Prompt Engineering - Creating structured prompts for JSON output

  3. Data Import & Splitting - Importing CSV data and creating train/test splits

  4. Model Training - Fine-tuning Gemma 3 4B with vision capabilities

  5. Custom Evaluators - Building specialized evaluation metrics

  6. Benchmarking - Comparing against GPT-4 and base models

  7. Deployment - Setting up production endpoints

Step 1: Dataset Selection

We'll be using the MM Food 100K dataset from Hugging Face, which contains 100,000 food images with detailed metadata including dish names, ingredients, nutritional information, and portion sizes. For this tutorial, we'll work with a subset of 1,000 samples to keep training times manageable while still demonstrating the full workflow.

Preview of the hugging face dataset

The dataset comes in CSV format with image URLs and structured metadata, making it ideal for training vision models to extract specific information from food photographs.

Step 2: Prompt Engineering

The first step in DataWiz is creating a prompt template that instructs the model on exactly what information to extract and in what format. We'll set up a system message that defines the model as a "food analyst" and specifies the exact JSON structure we want returned.

Our prompt will include:

  • A system message with clear instructions

  • A user message template with an image URL variable

  • An assistant message template showing the expected JSON structure

Prompt Template:

<system>
You are an expert food analyst, helping us identify the dish and its key ingredients from a given picture. Given the picture below, please identify the name of the dish (e.g. "Pho") and its ingredients (e.g. ["noodles","beef","basil","lime","green onions","chili"]).

Return your answer in the following JSON format:

{
    "dish_name": "Pho",
    "ingredients": ["noodles","beef","basil","lime","green onions","chili"],
    "nutritional_profile": {"fat_g":25.0,"protein_g":30.0,"calories_kcal":400,"carbohydrate_g":15.0},
    "portion_size": ["noodles:200g","beef:100g","vegetables:50g"]
}

For the nutritional_profile, always provide fat_g, protein_g, calories_kcal and carbohydrate_g, even if they are 0.

For portion size - look at the main components of the meal (no need to provide it for secondary ingredients like sauces etc...). You can group together certain ones (like vegetables for instance). Always provide it in `<name>:<weight>g` format. All measurements in grams, and calories in kcal.
</system>

<user>
Add Image -- {{baseUrl}}
</user>

<assistant>
{
    "dish_name": "{{dish_name}}",
    "ingredients": {{ingredients}},
    "nutritional_profile": {{nutritional_profile}},
    "portion_size": {{portion_size}}
}
</assistant>

This structured approach ensures consistent outputs that can be easily parsed and evaluated programmatically.

Step 3: Data Import & Preparation

With our prompt ready, we'll import the CSV data into DataWiz. The platform automatically materializes the raw CSV data into LLM-compatible input/output pairs using our prompt template.

Key steps in this phase:

  • Upload the CSV file (1,000 samples)

  • Apply the prompt template to transform raw data into training examples

  • Create a test split (10% of data) for unbiased evaluation

This separation ensures fair benchmarking later, as we'll evaluate the model on data it hasn't been trained on.

Step 4: Model Training

For this task, we'll fine-tune the Gemma 3 4B Instruction-tuned model, which offers:

  • Vision capabilities for processing images

  • Sufficient parameters (4B) for our specialized task

  • Efficient training and inference compared to larger models

Training configuration:

  • Dataset: 900 training samples (after splitting)

  • Epochs: 5 (with early stopping to prevent overfitting)

  • Vision mode: Enabled to process and encode images

  • Default hyperparameters for other settings

The training process automatically provisions GPU resources and monitors for overfitting, stopping early if validation loss begins to increase.

Step 5: Creating Custom Evaluators

To properly assess our model's performance, we'll create four custom evaluators, each designed to measure a specific aspect of the model's output:

Dish Name Evaluator (LLM-as-Judge)

You are an expert evaluator, helping us evaluate an AI system for recogniazing dishes from images.

You will recieve two JSON objects that contain a dish_name key in them. I want you to evaluate wethwer these dish names reference the same dish (could be different spelling, casing or wording)

## Output: 
{{output}}

## Reference Output: 
{{reference_output}}

Portion Size Evaluator (Code-based)

import json
import re

def preprocess_json_string(s):
    s = re.sub(r"^```(json|xml)|```$", "", s).strip()
    return s

def calculate_similarity_score(baseline_value, predicted_value):
    """
    Calculate similarity score between two values.
    - Score of 1.0 if identical
    - Score of 0.0 if error is 100% or more
    - Linear interpolation in between
    """
    if baseline_value == 0:
        # Handle edge case where baseline is 0
        return 1.0 if predicted_value == 0 else 0.0
    
    # Calculate absolute percentage error
    error_percentage = abs(predicted_value - baseline_value) / baseline_value
    
    # Cap at 100% error (score of 0)
    if error_percentage >= 1.0:
        return 0.0
    
    # Linear interpolation: 1.0 at 0% error, 0.0 at 100% error
    return 1.0 - error_percentage

def parse_portion_string(portion_str):
    """
    Parse a portion string like 'noodles:200g' into (ingredient, value, unit)
    Returns (ingredient, numeric_value) tuple
    """
    if not isinstance(portion_str, str):
        return None, None
    
    # Split by colon
    parts = portion_str.split(':')
    if len(parts) != 2:
        return None, None
    
    ingredient = parts[0].strip().lower()
    amount_str = parts[1].strip()
    
    # Extract numeric value from amount string (e.g., "200g" -> 200)
    import re
    match = re.match(r'(\d+(?:\.\d+)?)', amount_str)
    if match:
        try:
            value = float(match.group(1))
            return ingredient, value
        except ValueError:
            return ingredient, None
    
    return ingredient, None

def portion_size_compare(baseline_output, new_output):
    baseline_output_string = preprocess_json_string(baseline_output)
    new_output_string = preprocess_json_string(new_output)
    
    # Default result structure
    default_result = {
        "portion_recall": 0.0,
        "value_similarity_score": 0.0,
        "overall_score": 0.0
    }
    
    try:
        baseline_output_json = json.loads(baseline_output_string)
        new_output_json = json.loads(new_output_string)
    except Exception:
        return default_result
    
    if baseline_output_json is None or new_output_json is None:
        return default_result
    
    # Get portion sizes
    baseline_portions = baseline_output_json.get('portion_size', [])
    predicted_portions = new_output_json.get('portion_size', [])
    
    if not isinstance(baseline_portions, list):
        baseline_portions = []
    if not isinstance(predicted_portions, list):
        predicted_portions = []
    
    # Parse baseline portions
    baseline_dict = {}
    for portion in baseline_portions:
        ingredient, value = parse_portion_string(portion)
        if ingredient and value is not None:
            baseline_dict[ingredient] = value
    
    # Parse predicted portions
    predicted_dict = {}
    for portion in predicted_portions:
        ingredient, value = parse_portion_string(portion)
        if ingredient and value is not None:
            predicted_dict[ingredient] = value
    
    if not baseline_dict:
        # No valid baseline portions
        return default_result
    
    # Calculate metrics
    matched_count = 0
    total_similarity_score = 0.0
    portion_details = []
    
    for ingredient, baseline_value in baseline_dict.items():
        if ingredient in predicted_dict:
            matched_count += 1
            predicted_value = predicted_dict[ingredient]
            similarity = calculate_similarity_score(baseline_value, predicted_value)
            total_similarity_score += similarity
            
            portion_details.append({
                "ingredient": ingredient,
                "baseline_value": baseline_value,
                "predicted_value": predicted_value,
                "similarity_score": similarity,
                "matched": True
            })
        else:
            portion_details.append({
                "ingredient": ingredient,
                "baseline_value": baseline_value,
                "predicted_value": None,
                "similarity_score": 0.0,
                "matched": False
            })
    
    # Calculate final metrics
    baseline_count = len(baseline_dict)
    portion_recall = matched_count / baseline_count if baseline_count > 0 else 0.0
    value_similarity_score = total_similarity_score / baseline_count if baseline_count > 0 else 0.0
    
    # Overall score is the product of recall and average similarity
    # (you need both: finding the portions AND getting the values right)
    overall_score = portion_recall * value_similarity_score
    
    return {
        "portion_recall": portion_recall,  # % of baseline portions found in prediction
        "value_similarity_score": value_similarity_score,  # Average similarity of matched values
        "overall_score": overall_score  # Combined metric
    }

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    return portion_size_compare(reference_outputs['content'], outputs['content'])

Nutritional Profile Evaluator (Code-based)

import json
import re

def preprocess_json_string(s):
    s = re.sub(r"^```(json|xml)|```$", "", s).strip()
    return s

def calculate_similarity_score(baseline_value, predicted_value):
    """
    Calculate similarity score between two values.
    - Score of 1.0 if identical
    - Score of 0.0 if error is 100% or more
    - Linear interpolation in between
    """
    if baseline_value == 0:
        # Handle edge case where baseline is 0
        return 1.0 if predicted_value == 0 else 0.0
    
    # Calculate absolute percentage error
    error_percentage = abs(predicted_value - baseline_value) / baseline_value
    
    # Cap at 100% error (score of 0)
    if error_percentage >= 1.0:
        return 0.0
    
    # Linear interpolation: 1.0 at 0% error, 0.0 at 100% error
    return 1.0 - error_percentage

def nutritional_profile_compare(baseline_output, new_output):
    baseline_output_string = preprocess_json_string(baseline_output)
    new_output_string = preprocess_json_string(new_output)
    
    # Default result structure
    default_result = {
        "fat_g_score": 0.0,
        "protein_g_score": 0.0,
        "calories_kcal_score": 0.0,
        "carbohydrate_g_score": 0.0,
        "average_score": 0.0,
    }
    
    try:
        baseline_output_json = json.loads(baseline_output_string)
        new_output_json = json.loads(new_output_string)
    except Exception:
        return default_result
    
    if baseline_output_json is None or new_output_json is None:
        return default_result
    
    # Get nutritional profiles
    baseline_nutrition = baseline_output_json.get('nutritional_profile', {})
    predicted_nutrition = new_output_json.get('nutritional_profile', {})
    
    if not isinstance(baseline_nutrition, dict) or not isinstance(predicted_nutrition, dict):
        return default_result
    
    # Define the macros we're tracking
    macros = ['fat_g', 'protein_g', 'calories_kcal', 'carbohydrate_g']
    scores = {}
    
    for macro in macros:
        baseline_value = baseline_nutrition.get(macro, 0)
        predicted_value = predicted_nutrition.get(macro, 0)
        
        # Ensure values are numeric
        try:
            baseline_value = float(baseline_value)
            predicted_value = float(predicted_value)
        except (TypeError, ValueError):
            scores[f"{macro}_score"] = 0.0
            continue
        
        scores[f"{macro}_score"] = calculate_similarity_score(baseline_value, predicted_value)
    
    # Calculate average score
    score_values = [scores.get(f"{macro}_score", 0.0) for macro in macros]
    average_score = sum(score_values) / len(score_values) if score_values else 0.0
    
    return {
        **scores,
        "average_score": average_score
    }


def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    return nutritional_profile_compare(reference_outputs['content'], outputs['content'])

Ingredients List Evaluator (Code-based)

import json
import re

def preprocess_json_string(s):
    s = re.sub(r"^```(json|xml)|```$", "", s).strip()
    return s

def normalize_ingredient(ingredient):
    """Normalize ingredient for comparison - case insensitive and handle trailing 's'"""
    if not isinstance(ingredient, str):
        return ""
    
    # Convert to lowercase and strip whitespace
    normalized = ingredient.lower().strip()
    
    # Remove trailing 's' if present (simple pluralization)
    if normalized.endswith('s') and len(normalized) > 1:
        # Check if removing 's' creates a valid singular form
        # (avoid changing words like "ress" to "res")
        singular = normalized[:-1]
        # Basic heuristic: if the word ends in 'ss', 'us', 'is', keep the 's'
        if not normalized.endswith(('ss', 'us', 'is', 'os')):
            normalized = singular
    
    return normalized

def ingredients_compare(baseline_output, new_output):
    baseline_output_string = preprocess_json_string(baseline_output)
    new_output_string = preprocess_json_string(new_output)
    
    baseline_output_json = None
    new_output_json = None
    
    try:
        baseline_output_json = json.loads(baseline_output_string)
        new_output_json = json.loads(new_output_string)
    except Exception:
        return {
            "precision": 0,
            "recall": 0,
            "f1": 0,
            "jaccard": 0
        }
    
    if baseline_output_json is None or new_output_json is None:
        return {
            "precision": 0,
            "recall": 0,
            "f1": 0,
            "jaccard": 0
        }
    
    # Get ingredients arrays
    baseline_ingredients = baseline_output_json.get('ingredients', [])
    predicted_ingredients = new_output_json.get('ingredients', [])
    
    # Handle case where ingredients might not be lists
    if not isinstance(baseline_ingredients, list):
        baseline_ingredients = []
    if not isinstance(predicted_ingredients, list):
        predicted_ingredients = []
    
    # Normalize ingredients for comparison
    baseline_set = {normalize_ingredient(ing) for ing in baseline_ingredients if ing}
    predicted_set = {normalize_ingredient(ing) for ing in predicted_ingredients if ing}
    
    # Remove empty strings if any
    baseline_set.discard("")
    predicted_set.discard("")
    
    # Calculate metrics
    if len(predicted_set) == 0 and len(baseline_set) == 0:
        # Both empty - perfect match
        return {
            "precision": 1.0,
            "recall": 1.0,
            "f1": 1.0,
            "jaccard": 1.0
        }
    
    # Calculate overlaps
    true_positives = len(baseline_set & predicted_set)
    false_positives = len(predicted_set - baseline_set)
    false_negatives = len(baseline_set - predicted_set)
    
    # Precision: TP / (TP + FP)
    precision = true_positives / len(predicted_set) if len(predicted_set) > 0 else 0
    
    # Recall: TP / (TP + FN)
    recall = true_positives / len(baseline_set) if len(baseline_set) > 0 else 0
    
    # F1 Score: 2 * (precision * recall) / (precision + recall)
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # Jaccard Index: |A B| / |A ∪ B|
    union_size = len(baseline_set | predicted_set)
    jaccard = true_positives / union_size if union_size > 0 else 0
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "jaccard": jaccard
    }

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    return ingredients_compare(reference_outputs['content'], outputs['content'])

These evaluators use different approaches:

  • LLM-as-judge for semantic similarity (dish names)

  • Mathematical calculations for numerical accuracy (portions, nutrition)

  • Set-based metrics (precision, recall, F1) for list comparisons (ingredients)

Step 6: Benchmarking & Results

With our evaluators ready, we'll run a comprehensive evaluation comparing:

  • Our fine-tuned food identification model

  • The base Gemma 3 4B model (untrained)

  • GPT-4.1 Mini

  • GPT-4.1

The evaluation runs all 100 test samples through each model and applies all four evaluators to measure performance across different dimensions. Results show our fine-tuned model consistently outperforming the competition in accuracy while maintaining significantly lower latency and cost.

Evaluation Results

Pricing

On top of the faster response times & higher accuracy, our new 4B parameter model will be much faster to run. For reference, here's a price comparison

Header 1

GPT 4.1

GPT 4.1 Mini

Gemma 3 4B (on DW)

Cost / 1M Input Tokens

$ 2.00

$ 0.40

$ 0.20

Cost / 1M Output Tokens

$ 8.00

$ 1.60

$ 0.20

Cost / 1,000 Average Requests
- 250 Prompt Input Tokens
- 765 Image Tokens on OpenAI, 256 Image Tokens on DW

- 120 Output Tokens

$ 2.96

$ 0.59

$ 0.12


5x Cheaper than 4.1 Mini

25x Cheaper than 4.1

Step 7: Deployment

Once validated, deploying the model is straightforward:

  1. Deploy to Datawizz Serverless (or other providers like AWS, GCP, Azure)

  2. Create an endpoint for application integration

  3. Connect the prompt template for automatic formatting

  4. Use standard OpenAI SDK with updated base URL and API key

The deployed model can be called directly from applications, maintaining the same interface as popular LLM APIs while delivering specialized, high-performance results.

Summary

This tutorial demonstrated how Datawizz enables you to create specialized vision models that outperform general-purpose LLMs for domain-specific tasks. By fine-tuning a smaller model on targeted data, we achieved:

  • Higher accuracy than GPT-4.1 across all evaluation metrics

  • 50% faster inference compared to GPT-4.1 Full

  • Significantly lower costs due to smaller model size

  • Structured, reliable outputs perfect for production applications

The combination of easy data import, automated training, custom evaluation, and seamless deployment makes Datawizz an ideal platform for building specialized AI models that deliver better results at lower costs than generic alternatives. Whether you're working with food recognition, document processing, or any other vision-based task, this workflow provides a template for creating production-ready specialized models.

In this post

In this post

In this post