At Landing.jobs, we partner with the best. We believe that Xing shares some of our core values in regards to helping professionals grow their careers. We offered Xing a voice on our platform.
XING is a professional social network, popular in German-speaking Europe. XING’s mission is to enable professionals to grow. This mission translates concretely in a range of services, including the possibility for its members to describe their professional experiences, search for and keep in contact with current and former colleagues, search for and browse job ads to plan their next career moves, as well as getting in contact with active recruiting professionals. In this blog post, we give an overview of how Data Scientists at XING leverage natural language processing to improve Data Science services like job recommendation systems and search services.
XING members can create professional profiles which include a history of working positions they held and further information (e.g. skills, tools and products they master) that might be of interest to potential future employers, clients or other contacts. This information is a great basis for XING to offer its users recommendations of suitable job ads. However, real-world recommender systems are complex and present many interesting challenges; one of them, especially in the case of job recommendations at XING, is that important features are covered in natural language. More precisely, no restriction is imposed on how users describe their skills and the job title of their working positions: they can enter whatever sequence of characters they judge appropriate.
One of these challenges posed by natural language is that descriptions can be in multiple languages: German and English are extensively used on XING. One particular profession, like “Software developer”, can be described with another character sequence in other languages, for example, “Software Entwickler” in German. Similarly, even within one language, different labels can refer to the same concept, for example, “Computer programmer” is often used to refer to the same profession as “Software Developer”. Additionally, ambiguity is often present in natural language texts. For example, the word “Architect” could refer to an architect building houses or to a software architect. This is mostly not a problem for humans since they can easily infer the intended meaning of the word “architect” based on other information available in the user profile, like the user’s educational background or skills; but this disambiguation is more difficult to handle programmatically. And finally, free-form text input often contains typos or misspellings which again increases the difficulty of processing this information computationally.
Analyzing natural language and inferring usable features for a job recommendation engine is thus challenging because it is difficult to derive content-based similarities between users and job postings, as well as among users and among job postings. Being able to derive similarities is a requirement of many recommender systems. We apply a three-fold strategy to address this challenge on our Data Science team: 1) autocomplete on the user input to reduce misspellings and lexical diversity at the source, 2) a taxonomy and string matching techniques to deal with existing lexical diversity, and 3) a neural network based approach called Word2Vec that can handle concepts and labels not captured in the taxonomy.
Based on Wikipedia, we built a taxonomy consisting of so-called entities, that are defined by a set of labels that refer to the same concept. For example, one of those entities groups the labels “Software developer”, “Software programmer”, “Software Entwickler” (German denomination) and other labels referring to the same concept. These entities are then searched for in the job ads and in the user data. We use two methods for searching entity references: 1) exact string search with an Aho-Corasick Trie; and 2) fuzzy matching with a character n-gram Lucene index.
Aho-Corasick is a Trie-based dictionary search algorithm that finds dictionary entries in time linear to the length of the searched text (in our use-case), independently on the number of dictionary entries. The paths in the Trie are the entity labels of our taxonomy that were previously normalized. Label normalization aims at increasing the number of detected entities and is currently limited to UTF-8 NFKC normalization, lower-casing, and white space uniformization. The values stored in the nodes of the Trie are the entity IDs so that we can quickly find entity references in a text from a user profile or a job posting. Further normalization techniques from Information Retrieval could be considered in the future, like stemming or lemmatization.
The Aho-Corasick algorithm described above does not handle typos or misspellings in user input. Thus, we additionally use a string similarity metric that is comparable to the Levenshtein edit distance, to find the entity label most similar to the user input. Since we do not have all possibly existing labels and concepts in our taxonomy, we use a heuristically defined threshold on the string metric to automatically decide if the concept referenced in the user input exists in our taxonomy. However, computing the Levenshtein distance between a user input and all entity labels stored in the taxonomy would be too time demanding. To speed this up, we first retrieve a set of candidate entity labels using a character n-gram Lucene index. This fuzzy matching trades off some precision in favor of recall which is to some extent desirable for job ad recommendations.
To give a technology overview, bootstrapping of entity matching in the millions of user profiles and job postings is done with an Oozie workflow mostly running Hive jobs on a Hadoop cluster. For updating entity matching online, which is needed when user profiles are updated or new job postings are published, we use the same procedure as described above, but in a Kafka Stream implementation. Most of our implementations are in Scala.
Current developments include the representation of user inputs into a Euclidian space using Word2Vec, a neural network based technique that is more difficult to use for recommendations than entity IDs but that has the advantage of not requiring any taxonomy while coping with lexical diversity to some extent. Not relying on a taxonomy also means that it covers concepts that are not in our taxonomy. And finally, to deal with the ambiguity of natural language, we are investigating a graph random walk with restart technique inspired by Moro et al., “Entity Linking meets Word Sense Disambiguation: a Unified Approach”, TACL 2014.
Red Pill or Blue Pill series: get on the driver’s seat
Red Pill or Blue Pill series: who is your reflection in the mirror?