The quality of Extended Reality (XR) applications is vital, particularly the rendering quality of the XR Graphical User Interface (GUI). Different from traditional two-dimensional (2D) applications, XR applications create a 3D digital scene for users, by rendering two distinct 2D images for the user’s left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as “SVI”) issues, however, undermine the rendering process of the user’s brain, leading to user discomfort and even adverse health effects. Such issues commonly exist in XR applications but remain under-explored. To comprehensively understand the SVI issues, we conduct an empirical analysis on 282 SVI bug reports collected from 15 XR platforms, summarizing 15 types of manifestations of the issues. The empirical analysis reveals that automatically detecting SVI issues is challenging, mainly because: (1) lack of training data; (2) the manifestations of SVI issues are diverse, complicated, and often application-specific; (3) most accessible XR applications are closed-source commercial software, we have no access to code, scene configurations, etc. for issue detection. Our findings imply that the existing pattern-based supervised classification approaches may be inapplicable or ineffective in detecting the SVI issues. To counter these challenges, we propose a unsupervised black-box testing framework named StereoID to identify the stereoscopic visual inconsistencies, based only on the rendered GUI states. StereoID generates a synthetic right-eye image based on the actual left-eye image and computes distances between the synthetic right-eye image and the actual right-eye image to detect SVI issues. We propose a depth-aware left-right-eye image translator to power the image generation process, which captures the expected perspective shifts between left-eye and right-eye images. We build a large-scale unlabeled XR stereo screenshot dataset with larger than 170K images from real-world XR applications, which can be utilized to train our depth-aware leftright-eye image translator and evaluate the whole testing framework StereoID. After substential experiments, depth-aware left-right-eye image translator demonstrate superior performance in generating stereo images, outpacing traditional architectures. It achieved the lowest average L1 and L2 losses and the highest SSIM score, signifying its prowess in pixel-level accuracy and structural consistency for XR applications. StereoID further demonstrates its power for detecting SVI issues in both user-reported dataset and wild XR applications. In summary, this novel framework enables effective detection of elusive SVI issues, benefiting the quality of XR applications.