Intertextuality Analysis of Medieval Scholastic Texts using Ngrams and Graphs
Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt
April 12, 2023, Institute for Medieval Research, Austrian Academy of Sciences
Outline
- Similarity
- Visualization
- Cluster Detection
Paragraph N-Gram Frequency Feature Vector
Paragraph | 4gram1 | 4gram2 | 4gram3 | 4gram4 | 4gram5 | 4gram6 | 4gram7 | 4gram8 | 4gram9 | 4gram10 | 4gram11 | 4gram12 |
Doc A | 2 | 3 | 0 | 2 | 0 | 5 | 1 | 2 | 0 | 0 | 3 | 1 |
cos(θ)=A⋅B‖A‖‖B‖=n∑i=1AiBi√n∑i=1A2i√n∑i=1B2i
Paragraph N-Gram Presence Feature Vector
Paragraph | 4gram1 | 4gram2 | 4gram3 | 4gram4 | 4gram5 | 4gram6 | 4gram7 | 4gram8 | 4gram9 | 4gram10 | 4gram11 | 4gram12 |
Doc A | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
Document Vectors A * B = DotProductVector = Intersection Vector (V)
IntersectionVectorSummation=n∑i=1AiBi
If Intersection Vector Summation >= 6, then Doc Vectors A and B are "similar"
"Bag of N-Grams Model"
Paragraph | 4gram1 | 4gram2 | 4gram3 | 4gram4 | 4gram5 | 4gram6 | 4gram7 | 4gram8 | 4gram9 | 4gram10 | 4gram11 | 4gram12 |
Doc A | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
Doc B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Dot Product Vector | 1x1=1 | 1x1=1 | 0x1=0 | 1x0=0 | 0x1=0 | 1x1=1 | 0x0=1 | 1x1=1 | 0x0=0 | 0x1=0 | 1x1=1 | 1x1=1 |
\#\{ a | \forall{ng}\forall{x}\forall{y}(IsFoundIn(ng,x) \land IsFoundIn(ng,y) \land x \neq y \} >= n
Sparql Query Graph

Successive Passage Similarity =
∀xn∀ym(R(xn,ym)∧R(xn+1,ym+1)∧R(xn+2,ym+2))
|
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
5 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
6 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
7 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
9 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
X |
|
|
|
|
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
|
= |
|
|
|
|
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
|
= |
3 |
|
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
6 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
7 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
9 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
X |
|
|
|
|
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
|
= |
|
|
|
|
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
|
= |
3 |
\#\{ a | \forall{t}(SuccessveReuse(t)\} >= 10
Example of Two High Embed Scores,
Clarified by Convolution Cluster Score
Example of High Embed Score Identifying Clusters
Overlooked by Convolution at N-Gram Intersection Threshold of 6
Example of Two High Embed Scores,
Clarified by Convolution Cluster Score
Intertextuality Analysis of Medieval Scholastic Texts using Ngrams and Graphs
Jeffrey C. Witt (Loyola University Maryland) https://jeffreycwitt.com | jcwitt@loyola.edu @jeffreycwitt
April 12, 2023, Institute for Medieval Research, Austrian Academy of Sciences
Slide Deck: http://jeffreycwitt.com/slides/2023-04-12-text-reuse
https://creativecommons.org/licenses/by-nc-sa/4.0/