Imagine a camera that can identify every car on a highway, detect a tumour in an X-ray in milliseconds, or recognise a defective part on a factory line — all without a human looking at each frame. That is Computer Vision. And with Ultralytics YOLO11, you can build systems like this in a few lines of Python.
This guide takes you from zero — installing Python — all the way to training your own YOLO model on a custom dataset. Every code block here is verified and runs. No skipping steps, no assuming you already know things.
- What Is Computer Vision?
- CV Frameworks: Quick Overview
- Why Ultralytics YOLO?
- Complete Environment Setup (Python → venv → install → notebook)
- The 3 Core CV Tasks at a Glance
- Task 1: Object Detection
- Task 2: Image Classification
- Task 3: Instance Segmentation
- How to Get & Prepare Datasets
- Which YOLO Model to Choose?
- Exporting Your Trained Model
- Frequently Asked Questions
What Is Computer Vision?
Computer Vision (CV) is the field of AI that enables machines to interpret and understand visual information — images, videos, and live camera streams. Just as humans use their eyes and brain together to understand a scene, CV uses cameras and deep learning models to do the same, often faster and with greater consistency than a human expert.
Some things CV can do right now, in production:
- Detect objects in video at 100+ frames per second (YOLO11n on GPU)
- Identify plant diseases in drone footage over entire farms
- Count people in a crowd from CCTV footage
- Read license plates on a moving vehicle
- Detect cancer in radiology scans with accuracy matching specialist doctors
- Guide a robot arm to pick and place objects it has never seen before
Computer Vision Frameworks: Quick Overview
Before diving into Ultralytics, here is a quick map of the landscape so you know what exists and when to use each:
🔥 PyTorch Research / Advanced
Meta's deep learning framework. Most research papers implement in PyTorch. Maximum flexibility, but requires building your own training loop. Ultralytics itself is built on PyTorch.
🟠 TensorFlow / Keras Production / Beginners
Google's framework. Keras (now part of TensorFlow) offers a simpler API. Great for mobile and edge deployment. Larger corporate ecosystem.
👁️ OpenCV Image Processing
The classic computer vision library. Does not do deep learning training itself, but handles reading images, video streams, drawing boxes, colour conversion, and preprocessing. Used alongside YOLO.
⚡ Ultralytics YOLO Best for Beginners + Production
Built on PyTorch. Provides detection, classification, segmentation, pose, and tracking in one package — with a 5-line training API. The most widely adopted CV framework in the world for practical applications.
Why Ultralytics YOLO?
YOLO stands for You Only Look Once. Unlike older detection systems that scanned an image multiple times, YOLO processes the entire image in a single forward pass through the network — making it fast enough to run in real time.
Ultralytics develops and maintains YOLO and has built it into a complete, beginner-friendly platform. The numbers speak for themselves:
Trusted by Duolingo, Shell, Siemens, Renault, Philips, Intel and thousands of other companies. YOLO11 is their latest generation — faster, more accurate, and supports 5 vision tasks from a single install.
Complete Environment Setup
We will set up a clean, professional Python environment for this project. Follow every step exactly — no skipping.
Install Python (if not already installed)
Open a terminal (Command Prompt on Windows, Terminal on Mac/Linux) and check if Python is installed:
bashpython --version
# or try:
python3 --version
If you see something like Python 3.10.12 or higher — you are good. If you get an error, install Python:
⚠️ Windows users: During installation, tick the checkbox "Add Python to PATH" before clicking Install. This is the most common beginner mistake — if you miss it, Python commands won't work in the terminal.
Create Your Project Folder
A clean folder for your Computer Vision work. Run these commands in your terminal:
bash# Create the project directory
mkdir learning_computer_vision
# Enter the folder
cd learning_computer_vision
Create a Virtual Environment
A virtual environment (venv) is an isolated Python installation for this project. It keeps your project's packages separate from other Python projects — a professional best practice.
bash# Create the virtual environment (a folder named 'venv' will appear)
python -m venv venv
python3 -m venv venv. On some systems, python points to Python 2. Always use the version that returns 3.10+ in Step 1.
Activate the Virtual Environment
Activating "switches" your terminal into the isolated environment. You must do this every time you open a new terminal for this project.
bash — Windowsvenv\Scripts\activate
bash — Mac / Linux
source venv/bin/activate
After activating, your terminal prompt will show (venv) at the start — that is how you know it is active:
Install Ultralytics and Jupyter
With the venv active, install the required packages:
bash# Install Ultralytics (includes PyTorch, OpenCV, and all dependencies)
pip install ultralytics
# Install Jupyter for interactive notebooks
pip install jupyter
Verify the installation worked:
pythonpython -c "import ultralytics; ultralytics.checks()"
You should see output like this (verified on our machine):
Create Your Jupyter Notebook
Launch Jupyter in your project folder:
bashjupyter notebook
Your browser will open at http://localhost:8888. Click New → Python 3 (ipykernel) to create a new notebook. Rename it to cv_yolo_demo.ipynb.
code . and use the built-in Jupyter extension. It is the same experience without the browser. Install the "Jupyter" extension from the VS Code marketplace.
The 3 Core CV Tasks at a Glance
Ultralytics YOLO11 supports 5 vision tasks. Here we focus on the 3 most important ones for beginners:
Object Detection
Draws a bounding box around each object and labels it. Tells you what is in the image and where it is.
Use for: surveillance, counting, locating objects
Image Classification
Assigns a single label to the whole image. Tells you what is in the image — no location information.
Use for: quality control (pass/fail), medical categories
Instance Segmentation
Draws a pixel-level mask around each object — exact shape, not just a box. The most detailed output.
Use for: medical imaging, autonomous driving, fashion
A simple rule of thumb: if a bounding box is enough, use detection. If you only need one label per image, use classification. If you need the exact shape of each object, use segmentation.
Task 1: Object Detection
What it does
Object detection finds all instances of known objects in an image and draws a rectangular bounding box around each one, with a label and confidence score. For example, given a photo of a street, it might find: car (0.97), person (0.89), traffic light (0.82).
Running detection with a pretrained model (5 lines of code)
In your Jupyter notebook, create a new cell and type:
python — Cell 1: Verify setupimport ultralytics
ultralytics.checks()
# Expected: "Setup complete ✅"
python — Cell 2: Object Detection
from ultralytics import YOLO
# Load YOLO11 nano — pretrained on COCO (80 object classes)
# First run downloads the model (~6 MB) automatically
model = YOLO("yolo11n.pt")
# Run detection on a sample image
results = model("https://ultralytics.com/images/bus.jpg")
# Print what was detected
for r in results:
for box in r.boxes:
cls_name = model.names[int(box.cls[0])]
conf = float(box.conf[0])
print(f"Detected: {cls_name} ({conf:.0%} confidence)")
Verified output from our machine (Ultralytics 8.3.241):
# Save image with boxes drawn on it
results[0].save("detection_result.jpg")
print("Saved! Open detection_result.jpg to see the boxes.")
# OR display it directly in the notebook
results[0].show()
Understanding the result
The results[0] object contains everything about the detection:
results[0].boxes— list of all detected bounding boxesbox.cls— class ID (integer). Usemodel.names[int(box.cls[0])]to get the namebox.conf— confidence score (0.0 to 1.0).0.94means 94% surebox.xyxy— box coordinates as [x1, y1, x2, y2] in pixelsresults[0].orig_shape— original image size (height, width)
Training YOLO detection on your own dataset
Using a pretrained model on your own data is called fine-tuning. You take the model that already knows about 80 classes from COCO, and teach it your specific classes — like "phone" and "laptop" on a desk, or "crack" and "healthy" on a wall.
Step A — Dataset folder structure
YOLO expects this exact structure:
images/train/cat001.jpg → labels/train/cat001.txt. YOLO finds the label by replacing images/ with labels/ in the path.
Step B — Label file format
Each .txt label file has one line per object. Each line:
class_id— integer starting from 0 (0=cat, 1=dog, etc.)center_x, center_y— centre of the box (0.5 = middle of image)width, height— size of the box (1.0 = full image width/height)- If an image has no objects, the label file should exist but be empty
labels/ folder and the YAML file automatically. This saves hours of work.
Step C — The data.yaml file
Place this file inside your dataset folder. It tells YOLO where the data is and what classes exist:
yaml — my_dataset/data.yaml# Where your dataset lives (absolute or relative path)
path: my_dataset
# Paths to images (relative to 'path')
train: images/train
val: images/val
test: images/test # optional
# Number of classes
nc: 2
# Class names — index 0 must match label id 0
names:
0: cat
1: dog
Step D — Train the model
python — Training Cellfrom ultralytics import YOLO
# Start from a pretrained YOLO11 nano model
model = YOLO("yolo11n.pt")
# Train on your custom dataset
results = model.train(
data="my_dataset/data.yaml",
epochs=50, # number of training passes over data
imgsz=640, # image size (resize all images to 640×640)
batch=16, # images processed at once (lower if RAM is limited)
device="0", # use "0" for GPU, "cpu" for CPU-only
project="runs/train",
name="my_experiment"
)
After training, your best model weights are saved at runs/train/my_experiment/weights/best.pt. Load and use it like this:
best_model = YOLO("runs/train/my_experiment/weights/best.pt")
results = best_model("path/to/new_image.jpg")
results[0].show()
Task 2: Image Classification
What it does
Classification assigns a single label to the entire image. There are no boxes — the model simply answers "what is this image of?" with a probability for each class. Use it when you don't need to know where something is, only what it is.
Real examples: grading fruit quality (fresh vs rotten), classifying X-rays as normal or pneumonia, distinguishing between plant species.
python — Image Classificationfrom ultralytics import YOLO
# Load classification model (pretrained on ImageNet — 1000 classes)
model_cls = YOLO("yolo11n-cls.pt")
# Classify an image
results = model_cls("https://ultralytics.com/images/bus.jpg")
# Print top-5 predictions
for r in results:
top5_idx = r.probs.top5 # list of top 5 class indices
top5_conf = r.probs.top5conf # their confidence scores
print("Top 5 predictions:")
for idx, conf in zip(top5_idx, top5_conf):
print(f" {r.names[idx]}: {float(conf):.1%}")
Classification dataset structure (different from detection!)
Classification uses a folder-per-class structure — the folder name is the label. No label files needed.
Training a classification model
python — Classification Trainingfrom ultralytics import YOLO
model = YOLO("yolo11n-cls.pt") # pretrained on ImageNet
results = model.train(
data="my_clf_dataset/", # path to folder with train/ and val/ subfolders
epochs=30,
imgsz=224, # classification typically uses smaller images
batch=32
)
Task 3: Instance Segmentation
What it does
Segmentation goes further than detection: instead of a rectangular box, it draws a pixel-perfect mask around each instance of an object. If you have two overlapping cars, you get two separate masks — one for each car. This is called instance segmentation (as opposed to semantic segmentation which doesn't distinguish instances).
This requires more detailed annotations and more compute, but gives much richer output — critical for robotics, medical imaging, and precise manufacturing inspection.
python — Segmentation with pretrained modelfrom ultralytics import YOLO
# YOLO11 nano segmentation model
model_seg = YOLO("yolo11n-seg.pt")
results = model_seg("https://ultralytics.com/images/bus.jpg")
for r in results:
print(f"Objects with masks: {len(r.masks.data) if r.masks else 0}")
for box in r.boxes:
name = model_seg.names[int(box.cls[0])]
conf = float(box.conf[0])
print(f" {name}: {conf:.0%}")
# Save with masks drawn
results[0].save("segmentation_result.jpg")
Segmentation dataset structure
Same folder structure as detection, but the label format is different — instead of 4 box coordinates, you provide a polygon traced around the object:
Real training example: Car Parts Segmentation
Here is a real training run from the official Ultralytics notebook on the Car Parts Segmentation dataset (23 classes, 3,156 training images). This is what your terminal will look like when training runs:
• box_loss / seg_loss / cls_loss — Loss values. Lower = better. Watch these decrease across epochs.
• mAP50 — Mean Average Precision at 50% IoU overlap. Think of it as the model's overall accuracy score. 0.676 = 67.6% — good for 10 epochs on a complex dataset.
• mAP50-95 — Stricter accuracy measure (averaged over multiple IoU thresholds). Always lower than mAP50.
• Mask mAP50 — Same as mAP50 but evaluating the quality of the predicted masks (not just boxes).
Training code for segmentation
python — Segmentation Trainingfrom ultralytics import YOLO
# Start from pretrained YOLO11 nano segmentation model
model = YOLO("yolo11n-seg.pt")
results = model.train(
data="my_dataset/data.yaml", # same YAML as detection
epochs=50,
imgsz=640,
batch=16,
project="runs/segment",
name="carparts_experiment"
)
# Run inference with the best trained model
best = YOLO("runs/segment/carparts_experiment/weights/best.pt")
results = best("test_car.jpg", save=True)
How to Get and Prepare Datasets
The biggest challenge for beginners is always: "Where do I get images and how do I label them?" Here are your best options:
🥇 Roboflow Universe
100,000+ public datasets, all downloadable in YOLO format. Also has a free annotation tool for your own images.
universe.roboflow.com →🥈 Kaggle Datasets
Thousands of image datasets across every domain. Many are already in YOLO format or can be converted easily.
kaggle.com/datasets →🥉 Ultralytics Datasets
Curated datasets officially maintained by Ultralytics — COCO, ImageNet, VOC, and specialty datasets. Download in one command.
docs.ultralytics.com/datasets →📸 Your Own Camera
For custom applications, collect your own images with a phone. Even 200–300 annotated images can produce a useful model when fine-tuning from pretrained weights.
Annotate with Roboflow →Minimum dataset sizes (rule of thumb)
- Fine-tuning from pretrained: as few as 50–100 images per class can work
- Good results: 500–1,000 images per class
- State-of-the-art custom model: 5,000+ images per class
- Always use an 80/20 train/validation split
- Vary lighting, angles, distances in your training images — diversity matters more than quantity
Which YOLO11 Model Size to Choose?
YOLO11 comes in 5 sizes — nano (n), small (s), medium (m), large (l), and extra-large (x). Each is a trade-off between speed and accuracy. Here is a practical guide:
| Model | Params | Speed | mAP50-95 | Best for |
|---|---|---|---|---|
| yolo11n nano | 2.6M | ⚡⚡⚡⚡⚡ Fastest | 39.5 | Raspberry Pi, real-time on CPU, mobile apps, edge devices |
| yolo11s small | 9.4M | ⚡⚡⚡⚡ | 47.0 | Low-power devices needing better accuracy, Jetson Nano |
| yolo11m medium | 20.1M | ⚡⚡⚡ | 51.5 | ✅ Best starting point. Good GPU, production use, balanced |
| yolo11l large | 25.3M | ⚡⚡ | 53.4 | High-accuracy requirements, RTX 3080+ GPU available |
| yolo11x extra-large | 56.9M | ⚡ | 54.7 | Maximum accuracy, research, powerful GPU (A100/H100) required |
• Running on a laptop CPU with no GPU → yolo11n
• Running on Google Colab (free T4 GPU) → yolo11s or yolo11m
• Training your first custom model → yolo11n (fast feedback)
• Deploying in a real product → yolo11m (best trade-off)
• Accuracy is the only thing that matters → yolo11x
Exporting Your Trained Model
Once trained, you can export your model to 17+ formats for deployment. The most important ones:
python — Export to ONNX (universal format)from ultralytics import YOLO
model = YOLO("runs/train/my_experiment/weights/best.pt")
# Export to ONNX — works everywhere, up to 3x faster on CPU
model.export(format="onnx")
# Export to TensorRT — up to 5x faster on NVIDIA GPU
model.export(format="engine")
# Export to TFLite — for Android/embedded devices
model.export(format="tflite")
| Format | Argument | Best for | Speed boost |
|---|---|---|---|
| PyTorch (.pt) | default | Development, fine-tuning | — |
| ONNX (.onnx) | onnx | Universal deployment, CPU speedup | ~3x CPU |
| TensorRT (.engine) | engine | NVIDIA GPU production | ~5x GPU |
| CoreML (.mlpackage) | coreml | Apple iOS / macOS apps | Native |
| TFLite (.tflite) | tflite | Android, embedded Linux | Optimised |
| OpenVINO | openvino | Intel CPU / VPU devices | ~3x Intel |
What to Learn Next
- Ultralytics official documentation — full API reference, all arguments
- Pose Estimation — detect human body keypoints (joints)
- Oriented Bounding Boxes (OBB) — rotated boxes for aerial imagery
- Ultralytics official notebooks — dozens of real working examples on Colab
- Roboflow Universe — find and annotate datasets for your project idea
- Google Colab — free GPU for training (no local GPU needed)
Frequently Asked Questions
Classification: one label for the whole image (no location). Detection: bounding boxes around each object + label. Segmentation: pixel-perfect outline (mask) around each object. Choose based on what your application actually needs — classification is simplest to set up, segmentation is most powerful but needs more data and compute.
Always start with yolo11n (nano) — it has 2.6M parameters and trains in minutes, so you can verify your dataset and setup quickly. Once everything works, upgrade to yolo11m (medium) for better accuracy. Only use large/xlarge if you have a powerful GPU and accuracy is critical.
For inference (using a trained model), no — CPU is fine for real-time detection with the nano model. For training, a GPU is strongly recommended. Use Google Colab for free T4 GPU access. Training yolo11n on a small dataset takes 10–30 minutes on a free Colab GPU.
Roboflow Universe is the best starting point — 100,000+ datasets downloadable directly in YOLO format. Kaggle also has thousands of CV datasets. For custom work, collect your own images and annotate with Roboflow's free annotation tool, which exports directly in YOLO format.
mAP50 (Mean Average Precision at 50% IoU) is the primary accuracy metric for object detection and segmentation. It measures how well your model finds objects and how accurately it boxes them. A score of 0.67 means 67% — good for a medium-sized dataset. Values above 0.8 are considered excellent. It increases as training progresses; if it plateaus, you may need more data or more epochs.
When fine-tuning from a pretrained model (which is always recommended), 50–100 images per class can produce a working model. For reliable production use, aim for 500+ images per class. Diversity (different angles, lighting, backgrounds) matters more than raw quantity. A dataset of 200 diverse images outperforms 1,000 near-identical ones.