Four AI models walked into an interview...
We assessed how well they could fulfill a writing role at a bilingual learning edtech startup. Here’s how they did.
One of the most common asks I get from both current and prospective clients is for referrals to ghostwriters. This is somewhat surprising; when ChatGPT catalyzed an explosion of AI language generators almost a year and a half ago, the online freelancer marketplace Upwork saw the largest freefall in demand for writers.
As a lifelong writer, I shudder at the notion that my one real talent can be so easily supplanted. But as someone who has spent fifteen years working in the social sector, I can also acknowledge that an AI model that can write well would be a boon to cash-strapped nonprofits. After all, the less resources and time an organization has to spend on writing fundraising emails, volunteer newsletters, and annual reports, the more resources and time it can dedicate to delivering mission-critical services, like providing nutritious food to hungry people, educating kids, or planting trees. So the big question is: can AI actually do the job well?
There’s a difference between the question of whether AI can write well, generally, and whether they can do social sector writing well. If you’re interested in the first question, I recommend listening to Ezra Klein’s podcast which dedicates the first episode in a series on AI to comparing the strengths and weaknesses of various models in performing various tasks, e.g. writing code or synthesizing information. Claude 3 is deemed “the most literary,” i.e. the best at writing. If you’re interested in the second question, read on!
I’ll admit that, in the social sector at least, AI isn’t facing stiff human competition. The sector is uniquely terrible at storytelling. This is partly because it suffers from excruciating industry jargon and grant requirements like logic models. It’s also because social sector writing is a particular beast of a craft, for the following reasons:
Unlike journalism or fiction or film, it isn’t directed at either a general or niche audience, but instead at an enormous range of audiences that run the gamut in terms of age, socioeconomic status, language and literacy, education levels, cultural backgrounds, and interests. An education nonprofit, for example, might be writing for donors, teachers, parents and caregivers, and students, each population of which has a ton of diversity in itself.
It has to balance a variety of genres ranging from the loftily abstract (mission statements, vision statements, value statements) to the pragmatic (statistics, how to sign up for a program). When making a persuasive argument (an op-ed, a position statement, a fundraising campaign), it draws upon moral claims, experience and observation, research and theory, or a combination of the above.
It has an existential imperative to tell stories in a way that doesn’t perpetuate harm, whether that’s in characterizing people and communities, presenting the problems, or choosing language. Plus, what we understand as harm is ever evolving, and our writing must keep up with that evolution. Some organizations also have an existential imperative to use storytelling to shift public perceptions. Fulfilling these imperatives requires nuance, which can’t be resolved with an equity language checklist. Grasping this kind of nuance takes a ton of time, intention, and contextual understanding, and humans often fail at it; I certainly do.
Most people working in the sector have neither the time nor expertise to navigate these three criteria, which most often means that their storytelling is all over the place. The stories that the executive director, development director, and volunteer manager each tell might suggest three different problem statements, reflect three different organizational cultures and styles, and convey three different understandings of what their values mean in the social context in which they operate. This is where I come in.
I work with mission-driven organizations to develop narrative frameworks, which define their audiences, nail down their organizational story, and codify their organizational language and voice to reflect their values and personality. I then train staff in using the framework so that any communications materials they generate going forward are compelling, consistent, and characteristic.
So I wondered: what would happen if I trained AI in using a client’s narrative framework?
I pitched the idea of the experiment to Vero, the founder of an amazing edtech startup with which I’d worked a couple years before. She agreed. Bilingual Generation felt like the perfect testing ground, because as an organization seeking to preserve heritage languages across generations and advance social justice education in multilingual and multicultural ways, it takes narrative nuance very seriously. Also, as an organization with some phenomenal human storytellers on its team, it could set a high bar for AI.
Here’s what happened.
Process
I adapted Bilingual Generation’s narrative framework into three AI training guides focused on different mediums: one on social media, one on the Bili learning app blog, and one on longform thought leadership. Cumulatively, the training guides included 40 pages of content, including a primer on Bilingual Generation, exemplars, and detailed instructions such as: “Intermittently incorporate Spanish words and phrases into English sentences, without ever writing full sentences in Spanish, e.g. ‘Me siento muy excited.’”
I then used the guides to train three different AI language models.
OpenAI’s GPT-4
Google’s Gemini Advanced
Anthropic’s Claude 3
I also trained an additional platform, Jasper, which uses Open AI’s GPT-3.5 language model as of writing this article, but which has been configured for marketing and communication purposes.
We used five training prompts to refine the guides. After each prompt, Vero and the Bilingual Generation team reviewed the content that the AI models generated, then provided specific and detailed feedback. I used the feedback to revise the guides before running the next training prompt.
After iterating and revising through the five training prompts, we administered three test prompts:
Generate an Instagram post featuring 5-8 children's Spanish language books by Latinx authors celebrating LGBTQIA identity.
Write a blog post about what gender inclusive Spanish is, how it functions, and why people use it. Include reflections on whether gender inclusive Spanish is appropriate for use with young children. If so, how.
Write an essay that dives into the intersections of identity, language, power, and politics. This essay should take an historical look at the legacy of linguistic oppression and elevate the role of raciolinguistics in impacting childrens’ sense of belonging in school.
We then compared the results from the four models against each other, and then against products written by a human writer from the Bilingual Generation team. I ranked each model in four categories, from 1-5, 1 being the least strong, 5 the strongest, and 3 meh.
Story alignment
Did the generated content not only follow the prompt instructions but also align with Bilingual Generation’s overall organizational story?
For example, although the Instagram prompt did not mention bilingual education, the training guides made it clear that all content should reinforce Bilingual Generation’s mission. The posts generated by GPT-4, Gemini, and Claude all emphasized the importance of bilingual education and heritage languages, while Jasper failed to include it at all.
Another key tenet of Bilingual Generation’s story is acknowledging and understanding linguistic trauma in communities. As a founder, Vero openly talks about her feelings of shame as a child that she didn’t know how to speak Spanish. GPT-4’s use of #bilingualisbetter in its response perpetuates that culture of shame.
Voice/ style consistency
Did the generated content sound like it was speaking and writing in Bilingual Generation’s voice? In other words, did it feel playful, approachable, and passionate while coming off as reflective and expert? Was the style succinct and to the point?
The results were a mixed bag. Gemini hit the nail on the head for social media, but tanked on the blog and thought piece. Claude captured the voice quite effectively for longer form pieces. GPT-4 and Jasper both completely missed the mark.
Despite multiple examples and explicit instructions to the contrary—e.g. “use simple sentences and fragments” and “avoid flowery language”—GPT-4 insisted on being verbose. It would not stop with the alliteration or the metaphors. Wharton professor and AI expert Ethan Mollick once said that GPT-4 “loves to talk about rich tapestries,” and true to form, GPT-4 began the blog post with this line: “In the vibrant tapestry of global cultures, language stands as a powerful tool for expression, identity, and inclusivity.”
Jasper also leaned toward the overblown, with lines like these: “we're celebrating love, diversity, and the beautiful spectrum of identities” for Instagram and “May this chapter serve as a guiding star, illuminating the path toward a future where every student can confidently say, ‘Here, I stand; I am understood, I am valued, and I belong.’” As for its long form products, rather than sounding expert and passionate, Jasper contrived to produce formulaic essays reminiscent of middle or high school, complete with topic sentences and a concluding paragraph that begins with… “in conclusion.”
Use of Spanish
Did the generated content effectively incorporate Spanish words?
Translanguaging, in which multilingual speakers strategically employ multiple languages to communicate, is core to both Bilingual Generation’s story and voice, so I felt it was worth assessing Spanish use in its own category. Gemini was pretty good in this category, but the other three platforms struggled. GPT-4 and Jasper insisted on either writing all the content in Spanish, or alternating between Spanish and English for full sentences, and needed to be corrected with follow-up prompts. And while Claude understood the assignment, Vero found that its cadence was off; the translanguaging “was a little difficult to read—it felt like too much switching at odd moments and was hard for me to read.”
Content accuracy
Was the generated content accurate in the information it conveyed?
All four models struggled in this category. For example, none of them produced a list of children’s books in the Instagram post that fulfilled all the criteria (celebrating LGBTQIA, written in Spanish by Latinx authors).
Rankings summarized below.
Long story short
I do think that, for another organization with a less nuanced approach to storytelling, the scores would probably have been higher. But also, I’d argue that organizations in the social sector should strive for as much nuance as possible!
Overall, neither Vero nor I were blown away by any of the AI-generated writing products. However, we were impressed by Claude’s ability to grasp longer-form writing, and Gemini often hit the mark on the social media front. All four platforms added value in helping us understand what aspects of story can be formulaic—which can then either be codified for efficient storytelling or avoided for original storytelling, or both. But the free versions of the platforms (other than Jasper, which doesn’t offer a free version) are sufficient for this purpose.
At this point, Claude is the only model that I would feel comfortable recommending to any client as worth the monthly subscription fee. That said, it still requires a human hand for writing a clear and specific prompt and a human eye for fact-checking and revision.