Intertextuality Analysis of Medieval Scholastic Texts using Ngrams and Graphs

Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt

April 12, 2023, Institute for Medieval Research, Austrian Academy of Sciences

Slide Deck: http://jeffreycwitt.com/slides/2023-04-12-text-reuse

https://creativecommons.org/licenses/by-nc-sa/4.0/

## Outline 1. Similarity 2. Visualization 3. Cluster Detection

## Similarity

Paragraph N-Gram Frequency Feature Vector

Paragraph	4gram1	4gram2	4gram3	4gram4	4gram5	4gram6	4gram7	4gram8	4gram9	4gram10	4gram11	4gram12
Doc A	2	3	0	2	0	5	1	2	0	0	3	1

$$ \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $$

Paragraph N-Gram Presence Feature Vector

Paragraph	4gram1	4gram2	4gram3	4gram4	4gram5	4gram6	4gram7	4gram8	4gram9	4gram10	4gram11	4gram12
Doc A	1	1	0	1	0	1	1	1	0	0	1	1

Document Vectors A * B = DotProductVector = Intersection Vector (V) $$ Intersection Vector Summation = \sum\limits_{i=1}^{n}{A_i B_i} $$ If Intersection Vector Summation >= 6, then Doc Vectors A and B are "similar"

"Bag of N-Grams Model"

Paragraph	4gram1	4gram2	4gram3	4gram4	4gram5	4gram6	4gram7	4gram8	4gram9	4gram10	4gram11	4gram12
Doc A	1	1	0	1	0	1	1	1	0	0	1	1
Doc B	1	1	1	0	1	1	1	1	0	1	1	1
Dot Product Vector	1x1=1	1x1=1	0x1=0	1x0=0	0x1=0	1x1=1	0x0=1	1x1=1	0x0=0	0x1=0	1x1=1	1x1=1

![intersection](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-02-01/image5.png) Similarity = X is similar to Y, if and only if $$ \\#\\{ a | \forall{ng}\forall{x}\forall{y}(IsFoundIn(ng,x) \land IsFoundIn(ng,y) \land x \neq y \\} >= n $$ where n = 6

## Visualization

### Sparql Query Graph ![](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-04-12-vienna/sparqlQuery.png)

## Cluster Detection

Successive Passage Similarity = $$ ∀x_n∀y_m(R(x_n,y_m)∧R(x_{n+1},y_{m+1} )∧R(x_{n+2},y_{m+2})) $$

||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2|0|0|0|0|0|0|0|0| |3|0|0|**0**|**0**|**0**|**0**|0|0| |4|0|0|**0**|**1**|**0**|**0**|0|0| |5|0|0|**0**|**0**|**1**|**0**|0|0| |6|0|0|**0**|**0**|**0**|**1**|0|0| |7|0|0|0|0|0|0|0|0| |8|0|0|0|0|0|0|0|0| |9|0|0|0|0|0|0|0|0|

||||| |---|---|---| |1|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

||||| |---|---|---|---| |0|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2|0|0|0|0|0|0|0|0| |3|0|0|**1**|**0**|**0**|**0**|0|0| |4|0|0|**0**|**0**|**0**|**0**|0|0| |5|0|0|**0**|**0**|**1**|**0**|0|0| |6|0|0|**0**|**1**|**0**|**1**|0|0| |7|0|0|0|0|0|0|0|0| |8|0|0|0|0|0|0|0|0| |9|0|0|0|0|0|0|0|0|

||||| |---|---|---| |1|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

||||| |---|---|---| |1|0|0|0| |0|0|0|0| |0|0|1|0| |0|0|0|1|

SubstantialSuccessiveTextSimilarity(t) = $$ \\#\\{ a | \forall{t}(SuccessveReuse(t)\\} >= 10 $$ where t = Question or Chapter

### Doc Embedding

### Doc2Vec `model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=10, epochs=100)` See [Quoc Le and Thomas Mikolov, "Distributed Representations of Sentences and Documents", *Proceedings of the 31st International Conference on Machine Learning* (2014)](https://arxiv.org/pdf/1405.4053v2.pdf) See [Genesim Python Library](https://radimrehurek.com/gensim/models/doc2vec.html) Training parameters: 85,410 Item Documents (Containing 442,269 paragraphs, containing 80+million words) Vector Size: 200 dimensions Number of Training Cycles: 100

### Example of Two High Embed Scores, ### Clarified by Convolution Cluster Score

### Example of High Embed Score Identifying Clusters ### Overlooked by Convolution at N-Gram Intersection Threshold of 6

### Example of Two High Embed Scores, ### Clarified by Convolution Cluster Score

## Questions