People are the very heart of our daily work and life. As we strive to leverage artificial intelligence to empower every person on the planet to achieve more, we need to understand people far better than we can today. Human–computer interaction plays a significant role in human-machine hybrid intelligence, and human understanding becomes a critical step in addressing the tremendous challenges of video understanding. In this paper, we share our views on why and how to use a human centric approach to address the challenging video understanding problems. We discuss human-centric vision tasks and their status, highlighting the challenges and how our understanding of human brain functions can be leveraged to effectively address some of the challenges. We show that semantic models, view-invariant models, and spatial-temporal visual attention mechanisms are important building blocks. We also discuss the future perspectives of video understanding.