BEHIND / THE / SCENES

I.
Measuring Fragment Similarity

First, I needed a way to measure how similar fragments were to each other. For simplicity, I decided to use Jaccard Similarities, a measure of how similar one document is to another based on how many shared words two documents share. I learned this measurement as a way search engines look for relevant documents in my Web Systems class, which made implementing and understanding it easier for this project!

It is important to note the limitations of this method, which doesn't account for word frequency (how many times a word is repeated in a fragment) or lexical similarity (for example, gold or goldsandaled).

II.
Categorizing Fragments

Next, I needed to categorize the fragments based on the Jaccard Similarity scores between each pair of fragments. Taking inspiration from my internship in computational biology where I wrote scripts to categorize chromosome shapes, I calculated a pairwise matrix of Jaccard Similarities.

Then, I inputted this matrix into an Hierarchical Agglomerative Clustering algorithm, which iteratively groups fragments based on "distance". The distance in this case is the numerical value we get from calculating the Jaccard Similarity.

III.
Visualizing with Circle Packing

Finally, once I had all of my groups, all I had to do was visualize the data! I accomplished this through D3's hierarchy library, where circular nodes can be placed on a 2D space from largest to smallest (to fill up as much space as possible). This created a much more organized and visually pleasing display compared to manually or randomly placing fragment groupings around. This way, smaller fragment groups can also take up less space, rather than being placed equidistant to larger fragment groups.