Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs
Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs
Fine-Tuning Gemma 3 with Multi Modal (Vision Images) Inputs
Sep 4, 2025
25 min read
In this tutorial, we'll walk through the complete process of training a specialized vision model on the DataWiz platform that can analyze food images and extract structured information including dish names, ingredients, nutritional profiles, and portion sizes. By fine-tuning a smaller Gemma 3 4B model on domain-specific data, we'll create a model that not only outperforms larger generic models like GPT-4 but also runs faster and more cost-effectively.
Tutorial Outline
Dataset Preparation - Using the MMFood100K dataset from Hugging Face
Prompt Engineering - Creating structured prompts for JSON output
Data Import & Splitting - Importing CSV data and creating train/test splits
Model Training - Fine-tuning Gemma 3 4B with vision capabilities
Custom Evaluators - Building specialized evaluation metrics
Benchmarking - Comparing against GPT-4 and base models
Deployment - Setting up production endpoints
Step 1: Dataset Selection
We'll be using the MM Food 100K dataset from Hugging Face, which contains 100,000 food images with detailed metadata including dish names, ingredients, nutritional information, and portion sizes. For this tutorial, we'll work with a subset of 1,000 samples to keep training times manageable while still demonstrating the full workflow.
The dataset comes in CSV format with image URLs and structured metadata, making it ideal for training vision models to extract specific information from food photographs.
Step 2: Prompt Engineering
The first step in DataWiz is creating a prompt template that instructs the model on exactly what information to extract and in what format. We'll set up a system message that defines the model as a "food analyst" and specifies the exact JSON structure we want returned.
Our prompt will include:
A system message with clear instructions
A user message template with an image URL variable
An assistant message template showing the expected JSON structure
Prompt Template:
<system>
You are an expert food analyst, helping us identify the dish and its key ingredients from a given picture. Given the picture below, please identify the name of the dish (e.g. "Pho") and its ingredients (e.g. ["noodles","beef","basil","lime","green onions","chili"]).
Return your answer in the following JSON format:
{"dish_name": "Pho",
"ingredients": ["noodles","beef","basil","lime","green onions","chili"],
"nutritional_profile": {"fat_g":25.0,"protein_g":30.0,"calories_kcal":400,"carbohydrate_g":15.0},
"portion_size": ["noodles:200g","beef:100g","vegetables:50g"]
}
For the nutritional_profile, always provide fat_g, protein_g, calories_kcal and carbohydrate_g, even if they are 0.
For portion size - look at the main components of the meal (no need to provide it for secondary ingredients like sauces etc...). You can group together certain ones (like vegetables for instance). Always provide it in `<name>:<weight>g` format. All measurements in grams, and calories in kcal.
</system><user>
Add Image -- {{baseUrl}}</user><assistant>{"dish_name": "{{dish_name}}",
"ingredients": {{ingredients}},
"nutritional_profile": {{nutritional_profile}},
"portion_size": {{portion_size}}
}
</assistant>
This structured approach ensures consistent outputs that can be easily parsed and evaluated programmatically.
Step 3: Data Import & Preparation
With our prompt ready, we'll import the CSV data into DataWiz. The platform automatically materializes the raw CSV data into LLM-compatible input/output pairs using our prompt template.
Key steps in this phase:
Upload the CSV file (1,000 samples)
Apply the prompt template to transform raw data into training examples
Create a test split (10% of data) for unbiased evaluation
This separation ensures fair benchmarking later, as we'll evaluate the model on data it hasn't been trained on.
Step 4: Model Training
For this task, we'll fine-tune the Gemma 3 4B Instruction-tuned model, which offers:
Vision capabilities for processing images
Sufficient parameters (4B) for our specialized task
Efficient training and inference compared to larger models
Training configuration:
Dataset: 900 training samples (after splitting)
Epochs: 5 (with early stopping to prevent overfitting)
Vision mode: Enabled to process and encode images
Default hyperparameters for other settings
The training process automatically provisions GPU resources and monitors for overfitting, stopping early if validation loss begins to increase.
Step 5: Creating Custom Evaluators
To properly assess our model's performance, we'll create four custom evaluators, each designed to measure a specific aspect of the model's output:
Dish Name Evaluator (LLM-as-Judge)
You are an expert evaluator,helping us evaluate an AI system forrecogniazing dishes from images.
Youwill recieve two JSON objects that contain a dish_name keyinthem. Iwant you to evaluate wethwer these dish names reference the same dish(could bedifferent spelling,casing or wording)
## Output:{{output}}
## Reference Output:{{reference_output}}
Portion Size Evaluator (Code-based)
importjsonimportredef preprocess_json_string(s):s = re.sub(r"^```(json|xml)|```$", "", s).strip()
returnsdef calculate_similarity_score(baseline_value,predicted_value):"""
Calculate similarity score between two values.
- Scoreof 1.0ifidentical
- Scoreof 0.0iferror is 100% or more
- Linear interpolationinbetween"""
ifbaseline_value == 0:
# Handle edge casewhere baseline is 0return1.0ifpredicted_value == 0else0.0
# Calculate absolute percentage errorerror_percentage = abs(predicted_value - baseline_value) / baseline_value
# Cap at 100% error(score of 0)iferror_percentage >= 1.0:return0.0
# Linear interpolation:1.0at 0% error,0.0at 100% errorreturn1.0 - error_percentagedef parse_portion_string(portion_str):"""
Parse a portion string like 'noodles:200g'into(ingredient,value,unit)Returns(ingredient,numeric_value)tuple"""
ifnot isinstance(portion_str,str):returnNone,None
# Split by colonparts = portion_str.split(':')iflen(parts) != 2:returnNone,Noneingredient = parts[0].strip().lower()amount_str = parts[1].strip()
# Extract numeric value from amount string(e.g.,"200g" -> 200)importrematch = re.match(r'(\d+(?:\.\d+)?)',amount_str)if match:try:value = float(match.group(1))returningredient,valueexcept ValueError:returningredient,Nonereturningredient,Nonedef portion_size_compare(baseline_output,new_output):baseline_output_string = preprocess_json_string(baseline_output)new_output_string = preprocess_json_string(new_output)
# Default result structuredefault_result = {"portion_recall":0.0,"value_similarity_score":0.0,"overall_score":0.0}try:baseline_output_json = json.loads(baseline_output_string)new_output_json = json.loads(new_output_string)except Exception:returndefault_resultifbaseline_output_json is None or new_output_json is None:returndefault_result
# Get portion sizesbaseline_portions = baseline_output_json.get('portion_size',[])predicted_portions = new_output_json.get('portion_size',[])ifnot isinstance(baseline_portions,list):baseline_portions = []ifnot isinstance(predicted_portions,list):predicted_portions = []
# Parse baseline portionsbaseline_dict = {}forportioninbaseline_portions:ingredient,value = parse_portion_string(portion)ifingredient and value is not None:baseline_dict[ingredient] = value
# Parse predicted portionspredicted_dict = {}forportioninpredicted_portions:ingredient,value = parse_portion_string(portion)ifingredient and value is not None:predicted_dict[ingredient] = valueifnot baseline_dict:
# No valid baseline portionsreturndefault_result
# Calculate metricsmatched_count = 0total_similarity_score = 0.0portion_details = []foringredient,baseline_valueinbaseline_dict.items():ifingredientinpredicted_dict:matched_count += 1predicted_value = predicted_dict[ingredient]similarity = calculate_similarity_score(baseline_value,predicted_value)total_similarity_score += similarityportion_details.append({"ingredient":ingredient,"baseline_value":baseline_value,"predicted_value":predicted_value,"similarity_score":similarity,"matched":True})else:portion_details.append({"ingredient":ingredient,"baseline_value":baseline_value,"predicted_value":None,"similarity_score":0.0,"matched":False})
# Calculate final metricsbaseline_count = len(baseline_dict)portion_recall = matched_count / baseline_count ifbaseline_count > 0else0.0value_similarity_score = total_similarity_score / baseline_count ifbaseline_count > 0else0.0
# Overall score is the product of recall and average similarity
# (you need both:finding the portions AND getting the values right)overall_score = portion_recall * value_similarity_scorereturn{"portion_recall":portion_recall,# % of baseline portions found in prediction
"value_similarity_score": value_similarity_score,# Average similarity of matched values
"overall_score": overall_score # Combined metric
}def evaluator(inputs: list,outputs: dict,reference_outputs: dict):returnportion_size_compare(reference_outputs['content'],outputs['content'])
Nutritional Profile Evaluator (Code-based)
importjsonimportredef preprocess_json_string(s):s = re.sub(r"^```(json|xml)|```$", "", s).strip()
returnsdef calculate_similarity_score(baseline_value,predicted_value):"""
Calculate similarity score between two values.
- Scoreof 1.0ifidentical
- Scoreof 0.0iferror is 100% or more
- Linear interpolationinbetween"""
ifbaseline_value == 0:
# Handle edge casewhere baseline is 0return1.0ifpredicted_value == 0else0.0
# Calculate absolute percentage errorerror_percentage = abs(predicted_value - baseline_value) / baseline_value
# Cap at 100% error(score of 0)iferror_percentage >= 1.0:return0.0
# Linear interpolation:1.0at 0% error,0.0at 100% errorreturn1.0 - error_percentagedef nutritional_profile_compare(baseline_output,new_output):baseline_output_string = preprocess_json_string(baseline_output)new_output_string = preprocess_json_string(new_output)
# Default result structuredefault_result = {"fat_g_score":0.0,"protein_g_score":0.0,"calories_kcal_score":0.0,"carbohydrate_g_score":0.0,"average_score":0.0,}try:baseline_output_json = json.loads(baseline_output_string)new_output_json = json.loads(new_output_string)except Exception:returndefault_resultifbaseline_output_json is None or new_output_json is None:returndefault_result
# Get nutritional profilesbaseline_nutrition = baseline_output_json.get('nutritional_profile',{})predicted_nutrition = new_output_json.get('nutritional_profile',{})ifnot isinstance(baseline_nutrition,dict)or not isinstance(predicted_nutrition,dict):returndefault_result
# Define the macros we're tracking
macros = ['fat_g','protein_g','calories_kcal','carbohydrate_g']scores = {}formacroinmacros:baseline_value = baseline_nutrition.get(macro,0)predicted_value = predicted_nutrition.get(macro,0)
# Ensure values are numerictry:baseline_value = float(baseline_value)predicted_value = float(predicted_value)except(TypeError,ValueError):scores[f"{macro}_score"] = 0.0
continuescores[f"{macro}_score"] = calculate_similarity_score(baseline_value, predicted_value)
# Calculate average scorescore_values = [scores.get(f"{macro}_score",0.0)formacroinmacros]average_score = sum(score_values) / len(score_values)ifscore_values else0.0return{**scores,"average_score":average_score}def evaluator(inputs: list,outputs: dict,reference_outputs: dict):returnnutritional_profile_compare(reference_outputs['content'],outputs['content'])
Mathematical calculations for numerical accuracy (portions, nutrition)
Set-based metrics (precision, recall, F1) for list comparisons (ingredients)
Step 6: Benchmarking & Results
With our evaluators ready, we'll run a comprehensive evaluation comparing:
Our fine-tuned food identification model
The base Gemma 3 4B model (untrained)
GPT-4.1 Mini
GPT-4.1
The evaluation runs all 100 test samples through each model and applies all four evaluators to measure performance across different dimensions. Results show our fine-tuned model consistently outperforming the competition in accuracy while maintaining significantly lower latency and cost.
Pricing
On top of the faster response times & higher accuracy, our new 4B parameter model will be much faster to run. For reference, here's a price comparison
Header 1
GPT 4.1
GPT 4.1 Mini
Gemma 3 4B (on DW)
Cost / 1M Input Tokens
$ 2.00
$ 0.40
$ 0.20
Cost / 1M Output Tokens
$ 8.00
$ 1.60
$ 0.20
Cost / 1,000 Average Requests - 250 Prompt Input Tokens - 765 Image Tokens on OpenAI, 256 Image Tokens on DW
- 120 Output Tokens
$ 2.96
$ 0.59
$ 0.12
5x Cheaper than 4.1 Mini
25x Cheaper than 4.1
Step 7: Deployment
Once validated, deploying the model is straightforward:
Deploy to Datawizz Serverless (or other providers like AWS, GCP, Azure)
Create an endpoint for application integration
Connect the prompt template for automatic formatting
Use standard OpenAI SDK with updated base URL and API key
The deployed model can be called directly from applications, maintaining the same interface as popular LLM APIs while delivering specialized, high-performance results.
Summary
This tutorial demonstrated how Datawizz enables you to create specialized vision models that outperform general-purpose LLMs for domain-specific tasks. By fine-tuning a smaller model on targeted data, we achieved:
Higher accuracy than GPT-4.1 across all evaluation metrics
50% faster inference compared to GPT-4.1 Full
Significantly lower costs due to smaller model size
Structured, reliable outputs perfect for production applications
The combination of easy data import, automated training, custom evaluation, and seamless deployment makes Datawizz an ideal platform for building specialized AI models that deliver better results at lower costs than generic alternatives. Whether you're working with food recognition, document processing, or any other vision-based task, this workflow provides a template for creating production-ready specialized models.