In the past few years, there has been a lot of research on recognizing human actions in video analytics. However, the majority of these techniques analyze a video as a whole or classify each frame before assigning a single action label to the movie. However, it may be inferred that when contrasted with human eyesight, we (humans) just need a single instance of visual information to recognize a scene. It comes out that only a tiny collection of frames, or even just one, from the video, is sufficient for accurate recognition. Using frames from an ongoing stream of footage that can be received from a security camera, we offer a method in this study for almost real-time detection, localization, and recognition of events of interest. After a predetermined amount of time, the model accepts input frames and can assign an action label based on one frame. We estimated the action label for the stream of video by combining findings over a particular period of time. We illustrate demonstrates the YOLO approach is efficient and reasonably quick for localization and recognition in the Human Activities dataset.