Imagine a camera that can identify every car on a highway, detect a tumour in an X-ray in milliseconds, or recognise a defective part on a factory line — all without a human looking at each frame. That is Computer Vision. And with Ultralytics YOLO11, you can build systems like this in a few lines of Python.

This guide takes you from zero — installing Python — all the way to training your own YOLO model on a custom dataset. Every code block here is verified and runs. No skipping steps, no assuming you already know things.

✅ What you will build by the end of this guide: A working Python environment with Ultralytics installed, a Jupyter notebook that runs object detection, image classification, and instance segmentation — and the knowledge to train your own model on any custom dataset.

What Is Computer Vision?

Computer Vision (CV) is the field of AI that enables machines to interpret and understand visual information — images, videos, and live camera streams. Just as humans use their eyes and brain together to understand a scene, CV uses cameras and deep learning models to do the same, often faster and with greater consistency than a human expert.

Some things CV can do right now, in production:

  • Detect objects in video at 100+ frames per second (YOLO11n on GPU)
  • Identify plant diseases in drone footage over entire farms
  • Count people in a crowd from CCTV footage
  • Read license plates on a moving vehicle
  • Detect cancer in radiology scans with accuracy matching specialist doctors
  • Guide a robot arm to pick and place objects it has never seen before

Computer Vision Frameworks: Quick Overview

Before diving into Ultralytics, here is a quick map of the landscape so you know what exists and when to use each:

🔥 PyTorch Research / Advanced

Meta's deep learning framework. Most research papers implement in PyTorch. Maximum flexibility, but requires building your own training loop. Ultralytics itself is built on PyTorch.

🟠 TensorFlow / Keras Production / Beginners

Google's framework. Keras (now part of TensorFlow) offers a simpler API. Great for mobile and edge deployment. Larger corporate ecosystem.

👁️ OpenCV Image Processing

The classic computer vision library. Does not do deep learning training itself, but handles reading images, video streams, drawing boxes, colour conversion, and preprocessing. Used alongside YOLO.

Ultralytics YOLO Best for Beginners + Production

Built on PyTorch. Provides detection, classification, segmentation, pose, and tracking in one package — with a 5-line training API. The most widely adopted CV framework in the world for practical applications.

🎯 Which should you learn first? Start with Ultralytics. It gives you real working results immediately, teaches you the concepts (datasets, YAML files, training, evaluation metrics), and the underlying framework (PyTorch) can be explored later when you need more control. This is the top-down learning approach — build something first, understand the internals second.

Why Ultralytics YOLO?

YOLO stands for You Only Look Once. Unlike older detection systems that scanned an image multiple times, YOLO processes the entire image in a single forward pass through the network — making it fast enough to run in real time.

Ultralytics develops and maintains YOLO and has built it into a complete, beginner-friendly platform. The numbers speak for themselves:

129.8K+
GitHub Stars
254M+
Downloads
2.7B+
Daily Usages
1K+
Contributors

Trusted by Duolingo, Shell, Siemens, Renault, Philips, Intel and thousands of other companies. YOLO11 is their latest generation — faster, more accurate, and supports 5 vision tasks from a single install.

Complete Environment Setup

We will set up a clean, professional Python environment for this project. Follow every step exactly — no skipping.

1

Install Python (if not already installed)

Open a terminal (Command Prompt on Windows, Terminal on Mac/Linux) and check if Python is installed:

bash
python --version
# or try:
python3 --version

If you see something like Python 3.10.12 or higher — you are good. If you get an error, install Python:

📥 Install Python 3.10 or higher: Go to python.org/downloads → Download the latest stable release → Run the installer.

⚠️ Windows users: During installation, tick the checkbox "Add Python to PATH" before clicking Install. This is the most common beginner mistake — if you miss it, Python commands won't work in the terminal.
2

Create Your Project Folder

A clean folder for your Computer Vision work. Run these commands in your terminal:

bash
# Create the project directory
mkdir learning_computer_vision

# Enter the folder
cd learning_computer_vision
3

Create a Virtual Environment

A virtual environment (venv) is an isolated Python installation for this project. It keeps your project's packages separate from other Python projects — a professional best practice.

bash
# Create the virtual environment (a folder named 'venv' will appear)
python -m venv venv
⚠️ If python -m venv fails: Try python3 -m venv venv. On some systems, python points to Python 2. Always use the version that returns 3.10+ in Step 1.
4

Activate the Virtual Environment

Activating "switches" your terminal into the isolated environment. You must do this every time you open a new terminal for this project.

bash — Windows
venv\Scripts\activate
bash — Mac / Linux
source venv/bin/activate

After activating, your terminal prompt will show (venv) at the start — that is how you know it is active:

(venv) C:\Users\YourName\learning_computer_vision> _
5

Install Ultralytics and Jupyter

With the venv active, install the required packages:

bash
# Install Ultralytics (includes PyTorch, OpenCV, and all dependencies)
pip install ultralytics

# Install Jupyter for interactive notebooks
pip install jupyter
⏱️ How long does this take? Depending on your internet speed, the full installation (including PyTorch) takes 3–10 minutes. Ultralytics automatically installs the correct version of PyTorch for your system. You will see many lines of output — this is normal.

Verify the installation worked:

python
python -c "import ultralytics; ultralytics.checks()"

You should see output like this (verified on our machine):

Ultralytics 8.3.241 🚀 Python-3.11.14 torch-2.1.0 CPU (Apple M2) Setup complete ✅ (8 CPUs, 8.0 GB RAM, 392.3/460.4 GB disk)
6

Create Your Jupyter Notebook

Launch Jupyter in your project folder:

bash
jupyter notebook

Your browser will open at http://localhost:8888. Click New → Python 3 (ipykernel) to create a new notebook. Rename it to cv_yolo_demo.ipynb.

💡 Using VS Code instead? If you have VS Code installed, you can open your folder with code . and use the built-in Jupyter extension. It is the same experience without the browser. Install the "Jupyter" extension from the VS Code marketplace.

The 3 Core CV Tasks at a Glance

Ultralytics YOLO11 supports 5 vision tasks. Here we focus on the 3 most important ones for beginners:

📦

Object Detection

Draws a bounding box around each object and labels it. Tells you what is in the image and where it is.

Use for: surveillance, counting, locating objects

🏷️

Image Classification

Assigns a single label to the whole image. Tells you what is in the image — no location information.

Use for: quality control (pass/fail), medical categories

🎭

Instance Segmentation

Draws a pixel-level mask around each object — exact shape, not just a box. The most detailed output.

Use for: medical imaging, autonomous driving, fashion

A simple rule of thumb: if a bounding box is enough, use detection. If you only need one label per image, use classification. If you need the exact shape of each object, use segmentation.

Task 1: Object Detection

What it does

Object detection finds all instances of known objects in an image and draws a rectangular bounding box around each one, with a label and confidence score. For example, given a photo of a street, it might find: car (0.97), person (0.89), traffic light (0.82).

Running detection with a pretrained model (5 lines of code)

In your Jupyter notebook, create a new cell and type:

python — Cell 1: Verify setup
import ultralytics
ultralytics.checks()
# Expected: "Setup complete ✅"
python — Cell 2: Object Detection
from ultralytics import YOLO

# Load YOLO11 nano — pretrained on COCO (80 object classes)
# First run downloads the model (~6 MB) automatically
model = YOLO("yolo11n.pt")

# Run detection on a sample image
results = model("https://ultralytics.com/images/bus.jpg")

# Print what was detected
for r in results:
    for box in r.boxes:
        cls_name = model.names[int(box.cls[0])]
        conf = float(box.conf[0])
        print(f"Detected: {cls_name} ({conf:.0%} confidence)")

Verified output from our machine (Ultralytics 8.3.241):

image 1/1 bus.jpg: 640×480 Detected: bus (94% confidence) Detected: person (89% confidence) Detected: person (88% confidence) Detected: person (86% confidence) Detected: person (62% confidence)
python — Cell 3: Save result image
# Save image with boxes drawn on it
results[0].save("detection_result.jpg")
print("Saved! Open detection_result.jpg to see the boxes.")

# OR display it directly in the notebook
results[0].show()

Understanding the result

The results[0] object contains everything about the detection:

  • results[0].boxes — list of all detected bounding boxes
  • box.cls — class ID (integer). Use model.names[int(box.cls[0])] to get the name
  • box.conf — confidence score (0.0 to 1.0). 0.94 means 94% sure
  • box.xyxy — box coordinates as [x1, y1, x2, y2] in pixels
  • results[0].orig_shape — original image size (height, width)

Training YOLO detection on your own dataset

Using a pretrained model on your own data is called fine-tuning. You take the model that already knows about 80 classes from COCO, and teach it your specific classes — like "phone" and "laptop" on a desk, or "crack" and "healthy" on a wall.

Step A — Dataset folder structure

YOLO expects this exact structure:

my_dataset/ ├── images/ │ ├── train/ │ │ ├── img001.jpg │ │ ├── img002.jpg │ │ └── ... # your training images │ └── val/ │ ├── img101.jpg │ └── ... # your validation images (~20% of total) └── labels/ ├── train/ │ ├── img001.txt # annotations for img001.jpg │ ├── img002.txt │ └── ... └── val/ ├── img101.txt └── ...
📌 The naming rule: Each image and its label file must have the same name, just different extensions. images/train/cat001.jpglabels/train/cat001.txt. YOLO finds the label by replacing images/ with labels/ in the path.

Step B — Label file format

Each .txt label file has one line per object. Each line:

# format: class_id center_x center_y width height # all values are NORMALISED (0 to 1, relative to image size) 0 0.512 0.412 0.301 0.587 1 0.250 0.700 0.180 0.220
  • class_id — integer starting from 0 (0=cat, 1=dog, etc.)
  • center_x, center_y — centre of the box (0.5 = middle of image)
  • width, height — size of the box (1.0 = full image width/height)
  • If an image has no objects, the label file should exist but be empty
💡 Don't annotate by hand! Use Roboflow — a free annotation tool. Upload your images, draw boxes, and export directly in YOLO format. It generates the labels/ folder and the YAML file automatically. This saves hours of work.

Step C — The data.yaml file

Place this file inside your dataset folder. It tells YOLO where the data is and what classes exist:

yaml — my_dataset/data.yaml
# Where your dataset lives (absolute or relative path)
path: my_dataset

# Paths to images (relative to 'path')
train: images/train
val: images/val
test: images/test  # optional

# Number of classes
nc: 2

# Class names — index 0 must match label id 0
names:
  0: cat
  1: dog

Step D — Train the model

python — Training Cell
from ultralytics import YOLO

# Start from a pretrained YOLO11 nano model
model = YOLO("yolo11n.pt")

# Train on your custom dataset
results = model.train(
    data="my_dataset/data.yaml",
    epochs=50,          # number of training passes over data
    imgsz=640,          # image size (resize all images to 640×640)
    batch=16,           # images processed at once (lower if RAM is limited)
    device="0",         # use "0" for GPU, "cpu" for CPU-only
    project="runs/train",
    name="my_experiment"
)

After training, your best model weights are saved at runs/train/my_experiment/weights/best.pt. Load and use it like this:

python — Using your trained model
best_model = YOLO("runs/train/my_experiment/weights/best.pt")
results = best_model("path/to/new_image.jpg")
results[0].show()

Task 2: Image Classification

What it does

Classification assigns a single label to the entire image. There are no boxes — the model simply answers "what is this image of?" with a probability for each class. Use it when you don't need to know where something is, only what it is.

Real examples: grading fruit quality (fresh vs rotten), classifying X-rays as normal or pneumonia, distinguishing between plant species.

python — Image Classification
from ultralytics import YOLO

# Load classification model (pretrained on ImageNet — 1000 classes)
model_cls = YOLO("yolo11n-cls.pt")

# Classify an image
results = model_cls("https://ultralytics.com/images/bus.jpg")

# Print top-5 predictions
for r in results:
    top5_idx = r.probs.top5           # list of top 5 class indices
    top5_conf = r.probs.top5conf       # their confidence scores
    print("Top 5 predictions:")
    for idx, conf in zip(top5_idx, top5_conf):
        print(f"  {r.names[idx]}: {float(conf):.1%}")

Classification dataset structure (different from detection!)

Classification uses a folder-per-class structure — the folder name is the label. No label files needed.

my_clf_dataset/ ├── train/ │ ├── cat/ # folder name = class name │ │ ├── cat001.jpg │ │ ├── cat002.jpg │ │ └── ... │ └── dog/ │ ├── dog001.jpg │ └── ... └── val/ ├── cat/ │ └── cat_val001.jpg └── dog/ └── dog_val001.jpg
✅ Classification is the simplest task to set up! Just create folders named after your classes and put the images inside. No labelling tool needed, no YAML file for basic setups. Ultralytics auto-discovers the classes from folder names.

Training a classification model

python — Classification Training
from ultralytics import YOLO

model = YOLO("yolo11n-cls.pt")  # pretrained on ImageNet

results = model.train(
    data="my_clf_dataset/",  # path to folder with train/ and val/ subfolders
    epochs=30,
    imgsz=224,             # classification typically uses smaller images
    batch=32
)

Task 3: Instance Segmentation

What it does

Segmentation goes further than detection: instead of a rectangular box, it draws a pixel-perfect mask around each instance of an object. If you have two overlapping cars, you get two separate masks — one for each car. This is called instance segmentation (as opposed to semantic segmentation which doesn't distinguish instances).

This requires more detailed annotations and more compute, but gives much richer output — critical for robotics, medical imaging, and precise manufacturing inspection.

python — Segmentation with pretrained model
from ultralytics import YOLO

# YOLO11 nano segmentation model
model_seg = YOLO("yolo11n-seg.pt")

results = model_seg("https://ultralytics.com/images/bus.jpg")

for r in results:
    print(f"Objects with masks: {len(r.masks.data) if r.masks else 0}")
    for box in r.boxes:
        name = model_seg.names[int(box.cls[0])]
        conf = float(box.conf[0])
        print(f"  {name}: {conf:.0%}")

# Save with masks drawn
results[0].save("segmentation_result.jpg")

Segmentation dataset structure

Same folder structure as detection, but the label format is different — instead of 4 box coordinates, you provide a polygon traced around the object:

# Segmentation label format (one line per object) # class_id x1 y1 x2 y2 x3 y3 x4 y4 ... (polygon points, normalised) 0 0.10 0.20 0.30 0.15 0.45 0.30 0.40 0.60 0.20 0.65 0.08 0.45 1 0.55 0.22 0.75 0.18 0.80 0.50 0.60 0.58
⚠️ Creating segmentation labels by hand is very tedious. Always use an annotation tool. Roboflow and CVAT both support polygon annotation and export to YOLO segmentation format automatically.

Real training example: Car Parts Segmentation

Here is a real training run from the official Ultralytics notebook on the Car Parts Segmentation dataset (23 classes, 3,156 training images). This is what your terminal will look like when training runs:

Downloading https://github.com/ultralytics/assets/.../yolo11n-seg.pt ... 6.0MB YOLO11n-seg summary: 355 layers, 2,847,093 parameters, 10.4 GFLOPs Transferred 510/561 items from pretrained weights Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size 1/10 6.16G 1.319 2.775 4.069 1.475 110 640 2/10 6.09G 1.066 1.904 2.492 1.232 106 640 3/10 6.2G 0.9372 1.651 1.655 1.134 88 640 5/10 6.18G 0.8182 1.407 1.205 1.057 99 640 7/10 6.11G 0.7399 1.248 1.015 1.009 82 640 10/10 6.15G 0.6523 1.103 0.8649 0.9568 87 640 10 epochs completed in 0.229 hours. Results saved to runs/segment/train Final mAP50: 0.676 | Mask mAP50: 0.686 | mAP50-95: 0.561
📊 Understanding the training metrics:
box_loss / seg_loss / cls_loss — Loss values. Lower = better. Watch these decrease across epochs.
mAP50 — Mean Average Precision at 50% IoU overlap. Think of it as the model's overall accuracy score. 0.676 = 67.6% — good for 10 epochs on a complex dataset.
mAP50-95 — Stricter accuracy measure (averaged over multiple IoU thresholds). Always lower than mAP50.
Mask mAP50 — Same as mAP50 but evaluating the quality of the predicted masks (not just boxes).

Training code for segmentation

python — Segmentation Training
from ultralytics import YOLO

# Start from pretrained YOLO11 nano segmentation model
model = YOLO("yolo11n-seg.pt")

results = model.train(
    data="my_dataset/data.yaml",  # same YAML as detection
    epochs=50,
    imgsz=640,
    batch=16,
    project="runs/segment",
    name="carparts_experiment"
)

# Run inference with the best trained model
best = YOLO("runs/segment/carparts_experiment/weights/best.pt")
results = best("test_car.jpg", save=True)

How to Get and Prepare Datasets

The biggest challenge for beginners is always: "Where do I get images and how do I label them?" Here are your best options:

🥇 Roboflow Universe

100,000+ public datasets, all downloadable in YOLO format. Also has a free annotation tool for your own images.

universe.roboflow.com →
🥈 Kaggle Datasets

Thousands of image datasets across every domain. Many are already in YOLO format or can be converted easily.

kaggle.com/datasets →
🥉 Ultralytics Datasets

Curated datasets officially maintained by Ultralytics — COCO, ImageNet, VOC, and specialty datasets. Download in one command.

docs.ultralytics.com/datasets →
📸 Your Own Camera

For custom applications, collect your own images with a phone. Even 200–300 annotated images can produce a useful model when fine-tuning from pretrained weights.

Annotate with Roboflow →

Minimum dataset sizes (rule of thumb)

  • Fine-tuning from pretrained: as few as 50–100 images per class can work
  • Good results: 500–1,000 images per class
  • State-of-the-art custom model: 5,000+ images per class
  • Always use an 80/20 train/validation split
  • Vary lighting, angles, distances in your training images — diversity matters more than quantity

Which YOLO11 Model Size to Choose?

YOLO11 comes in 5 sizes — nano (n), small (s), medium (m), large (l), and extra-large (x). Each is a trade-off between speed and accuracy. Here is a practical guide:

Model Params Speed mAP50-95 Best for
yolo11n nano 2.6M ⚡⚡⚡⚡⚡ Fastest 39.5 Raspberry Pi, real-time on CPU, mobile apps, edge devices
yolo11s small 9.4M ⚡⚡⚡⚡ 47.0 Low-power devices needing better accuracy, Jetson Nano
yolo11m medium 20.1M ⚡⚡⚡ 51.5 Best starting point. Good GPU, production use, balanced
yolo11l large 25.3M ⚡⚡ 53.4 High-accuracy requirements, RTX 3080+ GPU available
yolo11x extra-large 56.9M 54.7 Maximum accuracy, research, powerful GPU (A100/H100) required
✅ Quick decision guide:
• Running on a laptop CPU with no GPU → yolo11n
• Running on Google Colab (free T4 GPU) → yolo11s or yolo11m
• Training your first custom model → yolo11n (fast feedback)
• Deploying in a real product → yolo11m (best trade-off)
• Accuracy is the only thing that matters → yolo11x
⚠️ Always start small. When building a new model, always start with nano (yolo11n). It trains in minutes, so you can quickly verify your dataset is correct, your YAML is set up properly, and the training loop works. Once everything is confirmed, scale up to medium or large.

Exporting Your Trained Model

Once trained, you can export your model to 17+ formats for deployment. The most important ones:

python — Export to ONNX (universal format)
from ultralytics import YOLO

model = YOLO("runs/train/my_experiment/weights/best.pt")

# Export to ONNX — works everywhere, up to 3x faster on CPU
model.export(format="onnx")

# Export to TensorRT — up to 5x faster on NVIDIA GPU
model.export(format="engine")

# Export to TFLite — for Android/embedded devices
model.export(format="tflite")
FormatArgumentBest forSpeed boost
PyTorch (.pt)defaultDevelopment, fine-tuning
ONNX (.onnx)onnxUniversal deployment, CPU speedup~3x CPU
TensorRT (.engine)engineNVIDIA GPU production~5x GPU
CoreML (.mlpackage)coremlApple iOS / macOS appsNative
TFLite (.tflite)tfliteAndroid, embedded LinuxOptimised
OpenVINOopenvinoIntel CPU / VPU devices~3x Intel

What to Learn Next

Frequently Asked Questions

What is the difference between detection, classification, and segmentation?+

Classification: one label for the whole image (no location). Detection: bounding boxes around each object + label. Segmentation: pixel-perfect outline (mask) around each object. Choose based on what your application actually needs — classification is simplest to set up, segmentation is most powerful but needs more data and compute.

Which YOLO model size should I start with?+

Always start with yolo11n (nano) — it has 2.6M parameters and trains in minutes, so you can verify your dataset and setup quickly. Once everything works, upgrade to yolo11m (medium) for better accuracy. Only use large/xlarge if you have a powerful GPU and accuracy is critical.

Do I need a GPU to use YOLO?+

For inference (using a trained model), no — CPU is fine for real-time detection with the nano model. For training, a GPU is strongly recommended. Use Google Colab for free T4 GPU access. Training yolo11n on a small dataset takes 10–30 minutes on a free Colab GPU.

Where do I get datasets for YOLO training?+

Roboflow Universe is the best starting point — 100,000+ datasets downloadable directly in YOLO format. Kaggle also has thousands of CV datasets. For custom work, collect your own images and annotate with Roboflow's free annotation tool, which exports directly in YOLO format.

What does mAP50 mean in training results?+

mAP50 (Mean Average Precision at 50% IoU) is the primary accuracy metric for object detection and segmentation. It measures how well your model finds objects and how accurately it boxes them. A score of 0.67 means 67% — good for a medium-sized dataset. Values above 0.8 are considered excellent. It increases as training progresses; if it plateaus, you may need more data or more epochs.

How many images do I need to train a YOLO model?+

When fine-tuning from a pretrained model (which is always recommended), 50–100 images per class can produce a working model. For reliable production use, aim for 500+ images per class. Diversity (different angles, lighting, backgrounds) matters more than raw quantity. A dataset of 200 diverse images outperforms 1,000 near-identical ones.