Cairo Geniza Pet Projects

Published on: July 2, 2024

By: Joshua Spergel

I'm excited to share updates on two ongoing projects related to the Cairo Geniza.

Project 1: Enhancing the Geniza Network Graph

In college, I created a letter network of Cairo Geniza documents from the 11th century. While I'm proud of this work, I've long recognized its limitations and have been eager to expand its scope. Here are the key improvements I'm working on:

Expanding the time range: The original graph only covered the 11th century. The new version will span from the 9th to the 17th century, encompassing the full range of the Cairo Geniza documents.
Including all document types: Previously, I only used letters. Now, I'm incorporating data from all types of documents found in the Geniza.
Enhancing metadata: The new graph will include additional information such as the starting and ending locations of letters and their precise dating.
Improving the user interface: I'm exploring better ways to visualize and interact with the data, possibly using tools like Palladio.

Thanks to advancements in AI and Large Language Models, extracting information from these documents has become much more efficient. I'm currently using dsPy to process a dataset about 25 times larger than the original, though this includes many one-off names and potential duplicates that require further cleaning.

I look forward to sharing a detailed blog post about my full process in the near future.

Project 2: Machine-Assisted Translations

The Princeton Geniza Project has transcriptions of 6,000 Cairo Geniza text documents, but only 600 have been translated into English. To address this gap, I'm experimenting with using large language models for translation, focusing primarily on documents originally written in Judeo-Arabic.

So far, I've machine-translated nearly all of these documents and compared them to existing human translations using semantic similarity scores and BLEU scores. The results vary from human-quality translations to significant divergences. I'm currently exploring ways to improve the translation quality, with a particular focus on using dspy to optimize my prompts.

I'm eager to share more detailed updates as they progress. Stay tuned for future posts where I'll dive deeper into the methodologies and challenges of each project.

Note: These projects are ongoing and experimental. While the results are promising, they are not yet ready for academic use without human verification.