Thanks for sharing this so publicly, Scott. I'm really enjoying following along and also learning how to use this approach in my own research as well.
What processes are you using for recording the steps that you've taken for quality assurance as a human? As I imagine, that would be really interesting and, in some instances, necessary to report if this was ever to be published as peer-reviewed research.
I‘m currently engaged in a similar endeavor with forum posts. But I‘m running all the time into token restrictions when trying to use cheaper batch processing, though in Google Cloud. OpenAI is supposedly even more restricted in this regard. To understand your workflow and data - when you say 300k documents - are these chunks of speeches? What is the overall token amount of the speech data?
Thanks - and looking forward to read your continued blog tomorrow.
You may be interested in this paper in Political Analysis by some RAND researchers: https://www.cambridge.org/core/journals/political-analysis/article/stay-tuned-improving-sentiment-analysis-and-stance-detection-using-large-language-models/2D8F121012D3D1CB2259B6DD5EE32D0D
They find that using both in-target tuning and some extra prompting strategies improves classification by a good margin.
Thanks for sharing this so publicly, Scott. I'm really enjoying following along and also learning how to use this approach in my own research as well.
What processes are you using for recording the steps that you've taken for quality assurance as a human? As I imagine, that would be really interesting and, in some instances, necessary to report if this was ever to be published as peer-reviewed research.
Hi Scott. Super interesting experiment and blog!
I‘m currently engaged in a similar endeavor with forum posts. But I‘m running all the time into token restrictions when trying to use cheaper batch processing, though in Google Cloud. OpenAI is supposedly even more restricted in this regard. To understand your workflow and data - when you say 300k documents - are these chunks of speeches? What is the overall token amount of the speech data?
Thanks - and looking forward to read your continued blog tomorrow.