Traditional CNNs can be used to detect objects on images. In case you want to detect more than one object you need to re-run the CNN. This is the main reason for CNNs not being able to run real-time object detection

Yolo = You Only Look Once is a real-time object detection system that is able to detect all objects of an image in a single run. With this you can process images with 40-90 FPS (on a Titan X GPU graphic card) which is good enough for real-time video-recognition.

Basic idea

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

Yolo uses a totally different approach. It applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.