A distributed UAV (unmanned aerial vehicle) flocking control method based on vision geometry is proposed, in which only monocular RGB (red, green, blue) images are used to estimate the relative positions and velocities between drones. It does not rely on special visual markers and external infrastructure, nor does it require inter-UAV communication or prior knowledge of UAV size. This method combines the advantages of deep learning and classical geometry. It adopts a deep optical flow network to estimate dense matching points between two consecutive images, uses segmentation technology to classify these matching points into background and specific UAV, and then maps the classified matching points to Euclidean space based on the depth map information. In 3D matching points, also known as 3D feature point pairs, each of their classifications is used to estimate the rotation matrix, translation vector, velocity of the corresponding UAV, as well as the relative position between drones, based on RANSAC and least squares method. On this basis, a flocking control model is constructed. Experimental results in the Microsoft Airsim simulation environment show that in all evaluation metrics, our method achieves almost the same performance as the UAV flocking algorithm based on ground truth cluster state.