Intertextuality Analysis of Medieval Scholastic Texts using Ngrams and Graphs


Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt


April 12, 2023, Institute for Medieval Research, Austrian Academy of Sciences

Slide Deck: http://jeffreycwitt.com/slides/2023-04-12-text-reuse

https://creativecommons.org/licenses/by-nc-sa/4.0/

## Outline 1. Similarity 2. Visualization 3. Cluster Detection
## Similarity

Paragraph N-Gram Frequency Feature Vector

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

Doc A

2

3

0

2

0

5

1

2

0

0

3

1

$$ \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $$

Paragraph N-Gram Presence Feature Vector

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

Doc A

1

1

0

1

0

1

1

1

0

0

1

1

Document Vectors A * B = DotProductVector = Intersection Vector (V) $$ Intersection Vector Summation = \sum\limits_{i=1}^{n}{A_i B_i} $$ If Intersection Vector Summation >= 6, then Doc Vectors A and B are "similar"
"Bag of N-Grams Model"

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

Doc A

1

1

0

1

0

1

1

1

0

0

1

1

Doc B

1

1

1

0

1

1

1

1

0

1

1

1

Dot Product Vector

1x1=1

1x1=1

0x1=0

1x0=0

0x1=0

1x1=1

0x0=1

1x1=1

0x0=0

0x1=0

1x1=1

1x1=1

![intersection](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-02-01/image5.png) Similarity = X is similar to Y, if and only if $$ \\#\\{ a | \forall{ng}\forall{x}\forall{y}(IsFoundIn(ng,x) \land IsFoundIn(ng,y) \land x \neq y \\} >= n $$ where n = 6
## Visualization
### Sparql Query Graph ![](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-04-12-vienna/sparqlQuery.png)
## Cluster Detection
Successive Passage Similarity = $$ ∀x_n∀y_m(R(x_n,y_m)∧R(x_{n+1},y_{m+1} )∧R(x_{n+2},y_{m+2})) $$
||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2|0|0|0|0|0|0|0|0| |3|0|0|**0**|**0**|**0**|**0**|0|0| |4|0|0|**0**|**1**|**0**|**0**|0|0| |5|0|0|**0**|**0**|**1**|**0**|0|0| |6|0|0|**0**|**0**|**0**|**1**|0|0| |7|0|0|0|0|0|0|0|0| |8|0|0|0|0|0|0|0|0| |9|0|0|0|0|0|0|0|0|

X

||||| |---|---|---| |1|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

=

||||| |---|---|---|---| |0|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

=

3

||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2|0|0|0|0|0|0|0|0| |3|0|0|**1**|**0**|**0**|**0**|0|0| |4|0|0|**0**|**0**|**0**|**0**|0|0| |5|0|0|**0**|**0**|**1**|**0**|0|0| |6|0|0|**0**|**1**|**0**|**1**|0|0| |7|0|0|0|0|0|0|0|0| |8|0|0|0|0|0|0|0|0| |9|0|0|0|0|0|0|0|0|

X

||||| |---|---|---| |1|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

=

||||| |---|---|---| |1|0|0|0| |0|0|0|0| |0|0|1|0| |0|0|0|1|

=

3

SubstantialSuccessiveTextSimilarity(t) = $$ \\#\\{ a | \forall{t}(SuccessveReuse(t)\\} >= 10 $$ where t = Question or Chapter
### Doc Embedding
### Doc2Vec `model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=10, epochs=100)` See [Quoc Le and Thomas Mikolov, "Distributed Representations of Sentences and Documents", *Proceedings of the 31st International Conference on Machine Learning* (2014)](https://arxiv.org/pdf/1405.4053v2.pdf) See [Genesim Python Library](https://radimrehurek.com/gensim/models/doc2vec.html) Training parameters: 85,410 Item Documents (Containing 442,269 paragraphs, containing 80+million words) Vector Size: 200 dimensions Number of Training Cycles: 100
### Example of Two High Embed Scores, ### Clarified by Convolution Cluster Score
### Example of High Embed Score Identifying Clusters ### Overlooked by Convolution at N-Gram Intersection Threshold of 6
### Example of Two High Embed Scores, ### Clarified by Convolution Cluster Score
## Questions