Creating a Network Graph from the Cairo Geniza

Published on: September 30, 2024

By: Joshua Spergel

The Data

I started with the Princeton Geniza Project's dataset. These notecards, transcribed and made available online by the Princeton Geniza Lab, include each document's PGPID (a unique sequential ID), written description, and document type (letter, legal, ketubah, list, etc.).

At last count, it contained over 35,000 documents, with roughly 11,000 letters and 8,000 legal documents. This vast amount of data necessitated automated techniques for processing.

Data Cleaning and Name Extraction

Initial Approach: Named Entity Recognition

My first attempt at extracting names used named entity recognition (NER). However, standard NER tools struggled with patronymic-based names like "Nahray ben Nissim", often misidentifying parts of the name as separate entities.

Refined Method: Regular Expressions

Fortunately, Goitein's consistent note-taking style for letters allowed for a simpler approach using regular expressions. The general format was:

"Letter from [sender] to [recipient]"

This let me extract senders and recipients using regex patterns:

Sender: Text between "from" and "to"
Recipient: Text after "to" until the next punctuation mark

While this worked perfectly in some cases, many letters required manual cleaning due to numerous edge cases.

Using an LLM to Get the Data

I experimented with using a large language model (LLM) to extract letter recipients and other metadata from the descriptions. However, this approach had mixed results and still required substantial manual cleaning and standardization.

I eventually used a combination of LLM and regex to extract the data. I used Groq's version of mixtral 8x7b, ported through Instructor to get the names, location, and other metadata, and then used regex to clean the data.

This worked pretty well! It was fairly fast and I generally got good results. It generally did better with two or three-shot prompts. I found that I still needed to go through and clean the data for a few rounds, but it saved a lot of time over doing it by hand.

However, the LLM had no way of knowing that "Shlomo b. Avraham" and "Shlomo b. Avraham ha-Kohen" were actually the same person, so I still needed to do a significant amount of cleaning by hand.

Manual Cleaning and Standardization

The process of manual cleaning involved:

Reviewing thousands of rows to ensure name consistency
Standardizing name spellings and honorifics
Identifying when different spellings referred to the same person

I used fuzzy name matching to find similar names, but this had limited success. Mostly, I had to go name by name, find similar ones, and check their original notes to determine if they were the same person or different people.

Creating the Network Graph

After standardizing the names, I used d3.js to build a network graph of letter senders and recipients. With help from Ben Johnson at the Princeton Geniza Lab, I created a small website to host the graph, later adding functionality to view letters by clicking on nodes (individuals) or links (connections between individuals).

Name Standardization Reference

Here's a reference list of standardized names and their variations that I used during the cleaning process:


{
    "aharon": ["aaron"],
    "avon": ["abon"],
    "avraham": ["abraham", "avraha", "ibrahim"],
    "ʿazarya": ['azaraiah', 'azarya'],
    "binyamin": ['benjamin'],
    "boʿaz": ["boaz", "bo'az"],
    "baruch": ["barukh", "baruk"],  
    "daniʾel": ['daniel', "dani'el"],
    "david": ["daud", "dawud"],  
    // ... (rest of the list)
    "tripoli": ["ha-itrabulusi", "ha-atrabulsi", "ha-itrabulsi"]
}

Note: Most of these notes were written by S.D. Goitein over 35 years of research. He created over 35,000 index cards detailing his notes on the Cairo Geniza. About 9,000 of them were letters. Each index card describes a single fragment from the Geniza, forming the basis of the network graph I created.