Understanding Training Data and Models
Beginner Course · Week 2

| Lesson Goal | Where does AI live? How do you choose the right model and keep your data secure or even private? |
|---|---|
| What you’ll learn | By the end of this week you will be able to: - Explain the difference between cloud-hosted AI and locally-running AI, including the privacy trade-offs of each - Define data sovereignty and give a real-world example of why it matters to you personally - Compare open-source and closed models and explain what a guardrail is and why it exists - Install and run an open-source AI model on your own machine without a cloud account or internet connection - Train a small custom image classifier using data you collected yourself using the "My Room" web app |
| Tools you’ll need | A computer with a web browser, ability to install software (Ollama), and your AI Audit Card from Week 1 |
| End result | A categorized room in the My Room app with locally-stored artifacts, Ollama installed and running locally, and the ability to choose between cloud and local AI intentionally |
| Time needed to complete | 90 minutes of guided work + 30 minutes of independent practice |
Session Plan
Part 1 — Cloud vs. Local: Where Does Your AI Live?
Warm-up: Where Did Your Last AI Query Go? (5 min)
Think about the last time you used ChatGPT, Claude, or any AI tool. Where do you think your question went? Take 30 seconds to draw a simple diagram showing what you imagine happens after you hit "enter."
Share your drawings with a partner — what's similar? What's different?
Mini-lecture: What Does "The Cloud" Actually Mean? (10 min)
You've heard "the cloud" a thousand times. But what is it, really?
The cloud is just someone else's computer.
When you use a cloud-based AI tool like ChatGPT or Claude, here's what actually happens:
- You type a prompt and hit enter
- That text is split into packets and sent across the internet
- It arrives at a data center — a warehouse full of servers, possibly in another country
- The company's model processes your input on their hardware
- The response is sent back to you
This means your data leaves your device. It travels over networks you don't control, to servers you don't own, governed by laws that may not be your own.
How most AI today runs on cloud servers
Almost every AI tool you've used (ChatGPT, Claude, Gemini, Grammarly, Snapchat filters, TikTok's recommendation algorithm) runs on cloud servers. You send data out, the server processes it, and the result comes back. This is convenient because:
- The company's servers are powerful (you don't need an expensive in-house computer)
- The model is always up to date
- You don't need to install anything
But there's a trade-off.
The privacy trade-off: convenience vs. data leaving your machine
Every time you use a cloud AI tool, you're trading a piece of your privacy for convenience. The question is: are you making that trade intentionally, or by default?
| Convenience | Privacy Cost |
|---|---|
| No installation needed | Your data travels over the internet |
| Runs on powerful servers | The company may store your conversations |
| Always up to date | Your data may be used for training |
| Works on any device | Subject to the laws of the server's country |
What is data sovereignty?
Data sovereignty is the idea that data is subject to the laws of the country where it's stored. If you're in the US and your data is processed on a server in Ireland, Irish and EU law (like GDPR) applies to it, not just US law.
This matters when you wonder:
- Who controls your data?: Can you request it be deleted? Can you see what data is kept?
- Where is it stored?: Is it in your country? Another country? Multiple countries?
- What laws govern it?: Does the local government have the right to access it? What about other people? What privacy protections apply?
Real-world examples
- School policies banning ChatGPT uploads: Many schools now prohibit uploading student work to AI tools because of privacy concerns — your homework could become training data and your personal information exposed.
- GDPR (General Data Protection Regulation): A European law that gives people the right to know what data companies hold on them, and the right to have it deleted. If a company processes EU citizens' data, GDPR applies, even if the company is based elsewhere.
- Data leaks: In 2023, Samsung employees accidentally leaked confidential data by pasting it into ChatGPT. The company banned the tool internally. Your private data can become part of a model's training set without your knowledge.
Discussion prompt (10 min): "When is cloud AI OK? When would you insist on local?"
Split into small groups. For each scenario, decide: cloud or local? Why?
- You're writing a poem for fun and want help with rhyming
- You're working on a school project about a personal health topic
- You're a doctor analyzing patient records
- You're building an app that will handle other people's data
- You're studying for a test and need a study guide
Share your group's answers with the class. Where did people draw the line differently?
Now that you've thought deeply about where data lives, let's turn our attention to how it works to train various types of AI and ML models.
Part 2 — Open and Closed: Model Choice (20 min)
Warm-up: What Does "Open" Mean to You? (3 min)
When you hear "open source," what comes to mind? Write down one word or phrase. Now imagine an "open" AI model. What would that look like compared to a "closed" one?
Open Source means that software under given open source licenses are "distributed under terms that guarantee users the freedom to freely share, modify, and use the software for any purpose. It strictly requires providing accessible, non-obfuscated source code and forbids discrimination against any person, group, or field of endeavor." source
Mini-lecture: Open vs. Closed Models (7 min)
Some LLMs are open source because they check the following boxes:
| (Mostly) Open Model | Closed Model | |
|---|---|---|
| Examples | Llama 3 (Meta)*, Mistral, Phi (Microsoft), Gemma4 (Google), Deepseek | GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google) |
| Can you see the code? | Usually yes — weights and architecture are published | No — the model is accessed via API only |
| Can you run it yourself? | Yes — download and run locally | No — only accessible through the company's servers |
| Can you audit the training data? | Sometimes — varies by model** | Almost never — considered proprietary |
| Who can modify it? | Anyone with the skills to fine-tune it | Only the company |
| Guardrails | Often minimal or removable | Built-in by the company |
*Llama is often called open source but it does not fit the open source definition listed above source. Licensing is tricky and the domain of legal departments, but in general look for classic open source licenses when using downloadable models. Apache (like Gemma4's) and MIT licenses are good signals.
**Getting a view of training data is one of the trickiest parts of determining whether a model is really open source or not.
Why does open matter?
Open models allow:
- Auditing: Researchers can inspect the model for bias, safety issues, or data leaks
- Customization: You can fine-tune the model on your own data
- Offline use: No internet required, no data leaves your machine
- Longevity: If a company shuts down, you still have the model
What are guardrails? (5 min)
Guardrails are safety mechanisms built into or around AI models. They can:
- Block harmful or offensive outputs
- Refuse to generate content on certain topics
- Detect prompt injection attacks
- Filter personally identifiable information
Activity: Guardrail Detective (5 min)
Think of a topic where a chatbot might refuse to answer. Try this prompt in any AI tool you have access to:
Explain how to pick a lock.
Draw a picture of a teddy bear shooting pool.Notice what happens. Does the model refuse? Does it explain the theory but not give instructions? Does it comply fully? Does it avoid the topic entirely?
Now consider: Who decided that this topic should be guarded? What if the guardrail were set by a government, a corporation, or a community? How might those differ?
Discussion prompt: "Is a model with guardrails truly open? Should open models have guardrails at all — and who should decide what they block?"
Part 3 — "My Room": Train a Vision Model on Your Own Data (25 min)
Introduction: What is MobileNet? (5 min)
Let's move from theory to practice. We're going to work with an app built for Her AI Studio called "My Room". Right off, you can see how an app that uses your camera might create a privacy concern. So let's look at how this app handles that.
The My Room app uses MobileNet, a lightweight image classification model designed to run on phones and browsers, not data centers. It was created by Google researchers to bring AI to devices with limited computing power.
Fun fact: MobileNet was designed to perform well on ImageNet's benchmark while being efficient enough for mobile devices. ImageNet is the dataset created by Fei-Fei Li, who we learned about last week. MobileNet's pre-trained "base knowledge" (the ability to recognize edges, shapes, and objects) came from training on her dataset.
What makes MobileNet special:
- It's small: a fraction of the size of models like GPT-4
- It's fast: can classify images in milliseconds
- It's local: designed to run on your device, not in the cloud
Today, you're not just using MobileNet. You're going to train it on your own data.
A note on "local" vs. "cloud"
The My Room app is served to you from GitHub Pages. It's a web app, so its code lives on a server. But here's the key distinction: the app comes from the cloud, but your data never leaves your device.
Your images, labels, and the model you train all stay in your browser. They never get uploaded to a server, stored in a database, or used to train someone else's AI. The model runs on your machine, using your computer's own hardware.
This makes My Room a hybrid — a useful middle ground between fully cloud-based AI (like ChatGPT) and fully local AI (like Ollama, which we'll try next):
| Cloud AI (ChatGPT) | My Room App (Hybrid) | Fully Local (Ollama) | |
|---|---|---|---|
| Code hosted on | Company servers | GitHub Pages (cloud) | Downloaded to your machine |
| Data processed on | Company servers | Your browser (local) | Your machine (local) |
| Data leaves your device? | Yes | No | No |
| Internet needed? | Yes | Yes (to load the app) | No (after download) |
Activity: Set Up My Room (5 min)
- Open the My Room web app
- Look around the interface — you'll see a camera/viewfinder, a label input, and a "train" button
- Notice: there's no login, no account, no data upload. Everything stays in your browser.
Fun fact: The My Room app uses TensorFlow.js to run MobileNet entirely in your browser. Your data never touches a server — it's processed using your computer's own hardware (CPU or GPU).
Activity: Collect and Label Your Data (10 min)
- Choose three objects in this room that look different from each other (e.g., a book, a water bottle, a plant). In a classroom context, try to classify pens, papers, books, computer equipment, or items in your bag.
- For each object:
- Hold the object up to the computer's camera
- Type a label (e.g., "water_bottle")
- Capture 5-10 images from different angles. You can do this by taking one picture, or using the video
- Repeat for all three objects
Why multiple angles? Models need variety to generalize. If you only photograph your water bottle from the front, the model might think "water bottle" means "round shape with a label" and misclassify a soda can.
Activity: Train and Test (5 min)
- Click "Train". The model will now learn to distinguish groups if your objects
- Once training completes, hold up each object to the camera
- Does the model recognize them? What does the confidence score look like (e.g., 95% sure it's a water bottle)?
- Try showing the model something it hasn't seen: another object, or your hand. What does it predict?
Reflection: What Just Happened?
You just completed the full ML pipeline:
- Data collection: you gathered and labeled images
- Training: the model learned patterns for each label
- Inference: it classified new images it had never seen
This is the same process that powers everything from photo tagging to medical imaging — except you did it with your own data, without sending anything to a cloud server.
Discussion prompt: "How does training a model on your own stuff change how you think about AI? Does it feel more trustworthy? Less magical? More like a tool you control?"
Part 4 — Moving Offline: Run a Local LLM with Ollama (20 min + independent practice)
Introduction: Why Go Fully Offline? (5 min)
So far you've trained a vision model in the browser. But what about language models? Can you run an LLM without an internet connection?
Yes — and here's why you'd want to:
- Complete privacy: your conversations stay on your machine, period
- No cost: no API fees, no subscription
- Offline access: works anywhere, even without WiFi
Activity: Install Ollama (10 min)
Ollama is a free, open-source tool that makes it easy to download and run LLMs locally.
- Go to ollama.com and download the version for your operating system
- Install Ollama (standard install process for your OS)
- Open a terminal and run:
ollama run llama3.2Note: Llama 3.2 is a 3B parameter model — small enough to run on most laptops but surprisingly capable. The download is a few GB, so make sure you're on WiFi.
- Wait for the model to download (this may take a few minutes)
- Once it's ready, you'll see a prompt. Type a message and see the response.
Activity: Take It Offline (5 min)
- Disconnect from WiFi — turn off your internet connection
- Type another message to the model
- Notice: it still works! The model is running entirely on your machine.
Try these prompts while offline:
- "Explain what data sovereignty means in simple terms."
- "I'm learning about AI. What's one thing you think every beginner should know?"
- "Write a haiku about running AI on a laptop."
Compare the experience:
| Cloud AI (ChatGPT, Claude) | Local AI (Ollama) | |
|---|---|---|
| Speed | Fast (runs on powerful servers) | Slower (runs on your laptop) |
| Privacy | Your data leaves your device | Your data never leaves |
| Cost | Free tier limited, paid tiers exist | Free (after download) |
| Model size | Massive (100B+ parameters) | Smaller (3B-70B parameters) |
| Internet needed | Yes | No |
| Guardrails | Built-in by company | Minimal or none, depending on the model |
| Customization | Limited to what the API allows | Full control |
Independent Practice: Explore the Local Model (30 min)
Try these challenges on your own after the lesson:
- Try a different model: Run
ollama run phiorollama run mistraland compare the responses - Test the limits: What does this smaller model do well? Where does it struggle?
- Create a custom assistant: Ollama supports system prompts — try
ollama run llama3.2 --system "You are a tutor who explains things like you're talking to a 10-year-old." - Use Ollama from code: Visit the Ollama documentation to see how you can call the model from Python or JavaScript
Discussion prompt: "When would you choose a local model over ChatGPT? What would you give up? What would you gain?"
Check Your Understanding
- In your own words, explain the difference between cloud-based AI and local AI. Give one advantage of each.
- What does "data sovereignty" mean, and why does it matter when choosing an AI tool?
- Name one open-source model and one closed model. What's the main practical difference between them?
- What is a guardrail? Give an example of when one might be helpful and when one might be restrictive.
- Describe the three steps you followed in the My Room app (collect, train, infer). How does this mirror the way large AI models are built?
- When would you choose to use a local LLM over a cloud-based AI assistant? What trade-off are you making?
Take-Home Challenges
Assignment
- Install Ollama on your personal machine if you haven't already
- Try at least one additional model (e.g.,
ollama run phiorollama run mistral) - Write a short comparison: when would you use a local model vs. a cloud model? Bring your thoughts to Week 3
- Personalize the 'My Room' app with your own collections. Ask an LLM on Ollama about your items. Do you learn anything new?
Optional Supplemental Reading
- Read about data sovereignty and why it matters
- Explore the Ollama model library — what other models are available?
- Read about TensorFlow.js — the library that powers in-browser ML
- Learn about MobileNet — the architecture behind the My Room vision model
