In echocardiography (echo), an electrocardiogram (ECG) is conventionally used to temporally align different cardiac views for assessing critical measurements. However, in emergencies or point-of-care situations, acquiring an ECG is often not an option, hence motivating the need for alternative temporal synchronization methods. Here, we propose Echo-SyncNet, a self-supervised learning framework to synchronize various cross-sectional 2D echo series without any human supervision or external inputs. The proposed framework takes advantage of two types of supervisory signals derived from the input data: spatiotemporal patterns found between the frames of a single cine (intra-view self-supervision) and interdependencies between multiple cines (inter-view self-supervision). The combined supervisory signals are used to learn a feature-rich and low dimensional embedding space where multiple echo cines can be temporally synchronized. Two intra-view self-supervisions are used, the first is based on the information encoded by the temporal ordering of a cine (temporal intra-view) and the second on the spatial similarities between nearby frames (spatial intra-view). The inter-view self-supervision is used to promote the learning of similar embeddings for frames captured from the same cardiac phase in different echo views. We evaluate the framework with multiple experiments: 1) Using data from 998 patients, Echo-SyncNet shows promising results for synchronizing Apical 2 chamber and Apical 4 chamber cardiac views, which are acquired spatially perpendicular to each other; 2) Using data from 3070 patients, our experiments reveal that the learned representations of Echo-SyncNet outperform a supervised deep learning method that is optimized for automatic detection of fine-grained cardiac cycle phase; 3) We go one step further and show the usefulness of the learned representations in a one-shot learning scenario of cardiac key-frame detection. Without any fine-tuning, key frames in 1188 validation patient studies are identified by synchronizing them with only one labeled reference cine. We do not make any prior assumption about what specific cardiac views are used for training, and hence we show that Echo-SyncNet can accurately generalize to views not present in its training set. Project repository: github.com/fatemehtd/Echo-SyncNet>.