CAREER is a Transformer Model Used to Predict Occupational Outcomes

or how I noticed this paper when I saw Susan Athey's name attached to it at 3am this morning

Apr 20, 2024

∙ Paid

I’m in Memphis this weekend for a funeral. My uncle died two weeks ago and so his family and extended family (my family) have been traveling here. Apparently, under Covid, time from death to funeral was extended, and that may be a more permanent thing as my aunt was told that she had time for planning and that there was no rush. So we are here. After I hit send on this, though, I am driving to Nashville to see a close relative who has moved there and we are going to go see the new Guy Ritchie movie. I woke up about an hour ago at 3:30am CST with vivid but forgotten (and thankfully not stressful) dreams, thinking about the future, my career, and frankly, the meaning of life not to be melodramatic.

But while I was laying in bed, somehow I found this interesting new paper and couldn’t stop reading it, and so instead of doing my weekly Saturday morning collection of links, I decided to share bits of it with you. It’s a new predictive model coauthored 6 experts in computer science and machine learning, including Susan Athey, that they’re using with a gigantic corpus of online resumees to predict the career paths of individuals, and it’s the first time I’ve seen a Transformer model used in economics, so I wanted to share a little about it.

CAREER model

“CAREER: A Foundation Model for Labor Sequence Data” is a working paper that was first posted two years ago to ArXiv but which was updated at the end of February 2024.. Its coauthors are Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David M. Blei. It was Susan’s name that caught my eye though and was what caused me to dig more closely into the paper. The idea of the paper is laid out in this one question:

“Given an individual’s career history, what is the probability distribution of their occupation in the next timestep?”

So the paper is a predictive model conditioning on an individual’s entire career history and then predicting out of sample that person’s occupational outcomes. That part itself didn’t catch my eye, though for many labor economists, that part itself is what would catch their eye. What caught my eye is that it the authors use a transformer model. Before I explain what a transformer model is, listen to them explain their predictive model of occupation.

We develop a representation-learning method—a transformer adapted for modeling jobs—for building such predictive models of occupation. Our model is first fit to large-scale, passively-collected resume data and then fine-tuned to more curated economics datasets, which are carefully collected for unbiased generalization to the larger population. The representation it learns is effective both for predicting job trajectories and for conditioning in downstream economic analyses.

Transformer Models and the Attention Mechanism

Transformer models underpin the modern large language models like GPT-4. In explaining their model, they cite the famous Vaswani, et al. (2017) article entitled “Attention is All You Need” paper written by the then-Google team. If you are unfamiliar with this paper, check out this video explaining visually how transformers work:

The 2017 “Attention is All You Need”1 paper was what led the nascent startup and nonprofit firm, OpenAI, to begin experimenting with it, and upon discovering the increasing returns to scale, to buy up NVIDA GPUs, partner with Microsoft, and make its own breakthroughs in the development of ChatGPT, which is based on the transformer architecture.

The 2017 paper by Vaswani et al., "Attention Is All You Need," is what introduced the transformer model, which is the basis of CAREER. The transformer model is built around an innovative attention mechanism which allows the transformer model to focus on different parts of a sequence to better capture contextual relationships, which as it turns out is highly efficient for processing sequential data like texts or time series.

Transformers are not the first models to process data sequentially, though. One traditional model that did so is the Recurrent Neural Network (RNN), which processes sequences by looping through input data, allowing each step’s output to depend on previous steps. RNNs are adept at handling tasks where sequence order is important, such as language modeling, due to their ability to maintain information across the data sequence. However, they often struggle with learning long-range dependencies, leading to the development of variants to address these challenges.

So unlike a traditional model that processes data sequentially like the RNN model, Vaswani, et al. (2017)’s transformer architecture processes all elements of the input data simultaneously. They achieved this through layers of self-attention, a mechanism within neural networks that allows a model to weigh the importance of different parts of the input data relative to each other, improving its understanding and processing of the sequence. The transformer model then used feed-forward networks which made the model exceptionally efficient and scalable. This parallel processing capability allows for handling both larger datasets and longer sequences more effectively.

The transformer model represented a shift in how data sequences were handled, replacing RNN’s recurrent layer approach with a system that leverages self-attention to analyze entire sequences in one go. This approach has significantly improved performance in various tasks, including language translation and economic forecasting, making it a valuable tool for data-intensive applications.

Contribution to Labor and Economics

But this was the first time I put two and two together that authors and papers I had seen were working on this, too. Sarah Bana at Chapman University, for instance, was someone whose work I had been following ever since the old #EconTwitter days. And she has a paper using large language models to understand wage premia. Here’s video of her explaining the paper. Her models are BERT and GPT-3, thus they are immediately accessible by labor economists (or anyone).

Keep reading with a 7-day free trial

Subscribe to Scott's Mixtape Substack to keep reading this post and get 7 days of free access to the full post archives.