IPOD: An Industrial and Professional Occupations Dataset and its Applications to Occupational Data Mining and Analysis

October 22, 2019 · Entered Twilight · + Add venue

"Last commit was 5.0 years ago (≥5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: README.md, data, license.md

Authors Junhua Liu, Yung Chuen Ng, Kristin L. Wood, Kwan Hui Lim arXiv ID 1910.10495 Category cs.CL: Computation & Language Cross-listed cs.IR, cs.LG Citations 6 Repository https://github.com/junhua/ipod ⭐ 70 Last Checked 1 month ago

Abstract

Occupational data mining and analysis is an important task in understanding today's industry and job market. Various machine learning techniques are proposed and gradually deployed to improve companies' operations for upstream tasks, such as employee churn prediction, career trajectory modelling and automated interview. Job titles analysis and embedding, as the fundamental building blocks, are crucial upstream tasks to address these occupational data mining and analysis problems. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. We also illustrate the usefulness of IPOD by addressing two challenging upstream tasks, including: (i) proposing Title2vec, a contextual job title vector representation using a bidirectional Language Model (biLM) approach; and (ii) addressing the important occupational Named Entity Recognition problem using Conditional Random Fields (CRF) and bidirectional Long Short-Term Memory with CRF (LSTM-CRF). Both CRF and LSTM-CRF outperform human and baselines in both exact-match accuracy and F1 scores. The dataset and pre-trained embeddings are available at https://www.github.com/junhua/ipod.