RedDust: a Large Reusable Dataset of Reddit User Traits

LREC 2020 · Anna Tigunova, Paramita Mirza, Andrew Yates, Gerhard Weikum ·

Social media is a rich source of assertions about personal traits, such as {``}I am a doctor{''} or {``}my hobby is playing tennis{''}. Precisely identifying explicit assertions is difficult, though, because of the users{'} highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I{'}ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users{'} personal traits, which are (attribute, value) pairs, along with users{'} post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.

PDF Abstract