About three and a half years ago, OpenAI released the first generally available chatbot, ChatGPT, into the wild. Since then, AI has taken over almost every aspect of our lives. While mostly to our detriment, there are some use-cases where generative AI can be genuinely helpful. The biggest hurdle for any productive adoption of generative AI has been – and will remain for quite some time – the lack of concrete descriptions for how to use generative AI.
Since the launch of OpenAI, I have been keeping close tabs on the evolution of open-weight models (i.e., LLMs that you can just download and run on your own computer). While these models performed subpar for a long time, I have found that in recent months their quality has made a considerable jump. Now I feel like I can finally start thinking of some genuinely useful tasks for which generative AI can be used.
In this article, I will first briefly recap the evolution of how generative AI has been used more generally, before introducing my current local setup. In the third section, I will then present two concrete use-cases for which I have found that generative AI is excellent and can help quite a lot in the mentally demanding field of social scientific research.
The Evolution of Uses for Generative AI
As one of the first use-cases I saw, the YouTube channel “Linus Tech Tips” released a video only weeks after ChatGPT’s release testing its ability to help a technically less versed user in building a working computer. This is one of the commonly accepted use-cases for such things: Instead of googling widely available and stable information from base-recipes for cakes or sauces to building a computer, it is oftentimes simpler to ask AI for that. But when it comes to more specialized type of cognitive work, the results are far and wide between.
For coding, it has turned out that one needs to enforce tight boundaries for these tools to perform even remotely decent work. These tools, most prominently “Claude Code” or Open Source alternatives from “pi” (not the raspberry) to “OpenCode,” have slowly advanced to enable more and more coding work in various settings. These tools have become known as the category of “harness,” which evokes a mental image of a horse that becomes much more useful for our purposes once we slap a saddle onto it.
Of course, generative AI is supposed to be the most helpful when it does what it has been trained on: generate natural language text. Here, there are broadly speaking two categories of tasks that users typically perform. Either they let it generate some text ex nihilo; that is: users ask a question such as “Please generate some ideas that I could turn into a paper” and hope that the model generates some useful ideas. Or they provide it with some text, say, a paper, and ask it to summarize it.
Over and over, I found that letting an AI model just generate text from a very short prompt is not really helpful. This works if the model shall reproduce some generic knowledge (think a base recipe for a sauce), but it becomes less useful as the specificity of the task increases.
Asking generative AI about the base recipe for sauce Hollandaise is excellent because this type of recipe will occur thousands of times in its training data. It will likely produce a correct recipe, but without telling you a story about its aunt’s version.
But if you ask it some extremely specific questions pertaining to your personal research niche, the model will inevitably start to hallucinate and generate wrong facts. Simply because that kind of information is not common in the training data, and as such the model cannot generate correct answers.
The easiest way to reduce the amount of false ideas the model generates is by providing it with a bunch of text and asking it to essentially use only that text to produce an answer, in a way of summarization or condensing. Fortunately, there are plenty of use-cases where you can provide some text, even if the task sounds more generative than summarizing. To understand this, let’s head to discussing real-world use-cases for gen AI in research.
Two Use-Cases for Generative AI
It is these cases that I wish to talk about today: providing a lot of context to a model and asking it to do something with that text, rather than coming up with completely new text.
The following use-cases have been tested using the current Gemma 4 E4B model locally on a MacBook Pro (M2 Pro, 16 GB RAM) with a context window of 128,000 tokens, and a temperature setting of 0.8. Since these models typically generate somewhat random texts and are sensitive to slight variations in the “prompts,” you will get different results with different models, providers, and quantization forms.
Use-Case 1: Preparing a Paper for Submission
The first use-case I have identified is to get some help in rewriting a paper for submission to a journal. Typically, the academic workflow from idea to publication goes something like this: We have an idea, we do the research necessary to test the idea, and then we write up the results. But then, we have to publish it somewhere. Journals have slightly different demands to the structure of a paper. Some put more emphasis on the theory, others want more methods, and others again want to hear a lot about the empirical case. Unless you are doing some decisively theoretical or methodological work, a first draft of a paper will probably have about equal parts theory, methods, and empirics. But if you want to submit that paper to a journal that emphasizes theoretical grounding or storytelling, then you will obviously need to rewrite it slightly to reflect this focus.
I recently had to do just that, and was blanking out when having to think about how I should be rewriting it. I knew that the journal puts a lot of emphasis on rigorous theoretical work and less on empirics. It also had a specific angle, a specific sociological perspective that its published papers should roughly align with. I knew that my paper contained everything they demanded, but not in the form they wanted.
But then I had an idea: Since I had already been citing a few papers from that journal with very similar empirical cases or methods, I realized I could try out an idea: What if I gave the model a few papers that had been published in that journal, and asked it to identify how my current state of the paper deviated from that structure?
The prompt that I came up with and that delivered surprisingly usable results is the following (note that I have redacted pieces of information that could taint the peer-review process):
Attached you can find three papers; one by [redacted], one by [redacted], and one by me. Mine is the [redacted] paper. I am planning to submit the paper to the journal [redacted]. The two papers by [redacted] and [redacted] have been successfully published in [redacted] in the past years. I want you to take a look at all three papers and identify potentially problematic sections in my paper that I should fix before submission to increase my chances of successful publication. I have selected the [redacted] and [redacted] papers because they focus on quite similar phenomena, and as such it should be easier to identify the important similarities and dissimilarities between the papers. Since those papers have been successfully published, they clearly fulfilled the journal's requirements, and I want to know how good (or bad!) my paper is doing in this regard.
To streamline the process, here are a few hints on how to approach this:
- Pay special attention to the overall framing of the papers. The journal is focused on [redacted sub-field], but my paper looks at a quite [redacted other sub-field] phenomenon. This isn't a problem per se (since [redacted] and [redacted] got published, too), but it requires paying attention to the framing, and the story the papers are telling.
- Also take the theoretical framing into account – how do they do it, and how should I be doing it?
- My paper has gotten some feedback from colleagues already, and there is some cautious recommendation that [journal redacted] might be a good venue.
- Lastly, do not pay too much attention to my discussion and conclusion sections, since those yet have to be rewritten to align with the previous sections (specifically the hypotheses I now included in my background section which aren't accounted for there). But feel free to provide some pointers if you can make them out in this regard.
Please respond first with a TL;DR section for quick summarization, and then a more detailed breakdown of your findings. Thanks!
This prompt, alongside a hefty payload of three papers of about 20 pages each, took a few minutes to process, but yielded great results.
Now, of course there was a ton of sycophancy in the results (“You are tackling a major, complex topic” – yeah, shut up), but I was impressed by the accuracy of the results. It identified precisely what some other colleagues have also identified and which is a common problem of mine – too much methods, too little theory – but it also found a few potential avenues to get started with that.
At this point, I would love to share the TL;DR, as it helped immensely to get started with the rewriting process, but in redacting potentially sensitive things I found that only basic grammatical structures were retained, so that won’t actually be useful.
This is already an interesting finding in itself: It turns out that the model precisely carved out things I did, and categorized them correctly (such as measurements, theoretical angles, or the core mechanism). Effectively, for the model, this was a common summarization task.
This made the response helpful in two ways: First, it formed a very dense summarization of my main argument, which allowed me to reflect on the “big picture” a bit better and get a feeling for which details of the paper are maybe necessary, but not part of the core argument. This is the classical problem of “killing one’s darlings” – without external feedback, every single detail seems load-bearing. But most really aren’t. Second, because it also had two successfully published papers in its context window, the model was able to compare the general structure and make some suggestions on where my paper deviated from these examples.
Based on this response, it was extremely straight-forward to sit down over the course of a week and rewrite the story and framing of the paper entirely to align better with the focus of the journal in question. If this was actually effective for reaching the goal? I will let you know if the paper survives the peer review!
Use-Case 2: LLMs as “Friendly Reviewers”
A second use-case which, again, falls into the category of summarization, is letting an LLM do a review of a paper. Once I was done with rewriting the paper, I was interested in getting an initial review of the paper. Now, an LLM is not a human reviewer and lacks the critical lens of a researcher. But it’s better than nothing.
The best reviewers you can get are certainly those from the journal itself. However, in order to get reviews, you must first pass the threshold of desk reject. If you receive a desk reject, you won’t receive reviews. Only if the editors believe your paper roughly fits the journal and looks correct (i.e., it passes the “sniff test”), they will send it out to reviewers.
This means that actual, reliable reviews from your peers are quite difficult to get. And here’s where an LLM can also provide at least some help. The biggest problem I found with LLM-based reviews is that it is very hard to get it to be a reviewer 2. It will always be a friendly reviewer. However, given enough context and enough instructions, you can make a model do a relatively okay job of pointing out issues.
Here’s the prompt that I used to instruct the model to generate a review of the rewritten paper to get an initial judgment of how it may fare with humans:
You are an expert reviewer for a sociological journal. You are an expert in [redacted subfield] and a renowned institution in the field. We have just received a submission for consideration for publication in the journal, which you can find attached to this message. Your task is to provide a review of this paper that the editors can use to determine whether the submission is suitable for publication in the journal.
Please provide a thorough review of the submission. Focus in particular on the theoretical rigor of the submission, and check the logical soundness of each argument the paper makes. Make a suggestion for acceptance, minor revisions, or major revisions, and provide a justification of at most 800 words (or roughly two pages).
Please suggest a decision based on the following heuristics:
- Accept: The paper has only minimal issues that can be fixed comparatively quickly by the authors and do not impede the integrity of the analytical argument.
- Minor revision: The paper suffers from some minor flaws that the authors must fix, but which do not fundamentally destabilize the main argument.
- Major revision: The paper's main argument suffers from some problems that might impact the relevancy of the results or the conclusions drawn. The paper is still a sufficiently complex piece of work, but requires intensive work from the authors before it can be considered for publication.
Two notes on this prompt: First, you may notice that I did not provide the model with the opportunity to “reject” because, again, that LLM is not my actual reviewer and I want to have some actionable items, so I wanted it to give me a “vibe check” based on the contents of the paper. And second, I tried to provide specific instructions biasing the model to be a bit more “mean” than it might otherwise be. By framing my paper as “someone else’s” paper and positioning myself as an editor who is concerned less with some author but rather with the journal quality, I wanted to ensure that the RLHF training did not nudge the model in being nice towards the paper. If I had asked the model to review “my” paper, this likely would have triggered a more sycophantic response.
I used the model’s response in two ways. On the one hand, I was able to confirm that my rewritings actually made a difference, since the core arguments as the model extracted them were now closer aligned with a theoretically heavy framing, but it still recognized the methodological advances. And second, I used the responses to identify some parts in my paper where my argumentation appears not to be entirely sufficient yet, and polished those positions some more.
One interesting anecdote I have from the model’s response is how the model itself saw its role:
Reviewer Identity: Expert in [redacted subfield], Tenured Professor at a leading Sociology Department.
I guess what our kind lacks in terms of confidence is fully balanced out by the bolstering confidence of LLMs.
Final Thoughts
As we go along in the development of GPT models, we will see better and better models. And that also includes locally running models, which I prefer since they won’t leak any proprietary or confidential data to AI corporations. These are just two examples for how to use generative AI in academic work, and there are plenty more.
I always feel that it’s hard to figure out how to use such models, because there are few practical examples on the internet. Many still treat “prompting” as “engineering” (which is an insult to engineering as a profession), instead of just basic techniques to improve one’s work.
Three final notes to these use-cases. First, you will notice that the “prompts” are quite verbose. That is by design, to steer the generated text. The more text you provide, the less “wiggling room” the model’s probability distribution will have to hallucinate weird facts.
Second, you may have realized that I only used the model to help in writing my own papers, which I know in and out. Letting them summarize someone else’s papers cannot replace you reading the paper, since the model will overlook certain details that are important to you, will remain generic, and it might hallucinate information despite the correct information being in the context. Using them for one’s own work is unproblematic, as it will be quite simple for you to recognize mistakes.
Third, I have only used the model to give me some pointers. I did not take any of its output as a given, and the way I framed these two tasks ensured that I was still the primary decision-maker. The paper is still fully handwritten and has no AI slop in it, because I strongly believe that I should be the one doing the work. Other than that, generative AI indeed is becoming more and more useful every day.