Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 9;110(15):5802-5.
doi: 10.1073/pnas.1218772110. Epub 2013 Mar 11.

Private traits and attributes are predictable from digital records of human behavior

Affiliations
Free PMC article

Private traits and attributes are predictable from digital records of human behavior

Michal Kosinski et al. Proc Natl Acad Sci U S A. .
Free PMC article

Abstract

We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait "Openness," prediction accuracy is close to the test-retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.

Conflict of interest statement

Conflict of interest statement: D.S. received revenue as owner of the myPersonality Facebook application.

Figures

Fig. 1.
Fig. 1.
The study is based on a sample of 58,466 volunteers from the United States, obtained through the myPersonality Facebook application (www.mypersonality.org/wiki), which included their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric test scores, and survey information. Users and their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an association between a user and a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.
Fig. 2.
Fig. 2.
Prediction accuracy of classification for dichotomous/dichotomized attributes expressed by the AUC.
Fig. 3.
Fig. 3.
Prediction accuracy of regression for numeric attributes and traits expressed by the Pearson correlation coefficient between predicted and actual attribute values; all correlations are significant at the P < 0.001 level. The transparent bars indicate the questionnaire’s baseline accuracy, expressed in terms of test–retest reliability.
Fig. 4.
Fig. 4.
Accuracy of selected predictions as a function of the number of available Likes. Accuracy is expressed as AUC (gender) and Pearson’s correlation coefficient (age and Openness). About 50% of users in this sample had at least 100 Likes and about 20% had at least 250 Likes. Note, that for gender (dichotomous variable) the random guessing baseline corresponds to an AUC = 0.50.

Similar articles

Cited by 103 articles

Publication types

LinkOut - more resources