Text Re-Use Detection mit Ngrams und Graphen.
Zur Analyse von Intertextualität in den Texten und Kommentaren der mittelalterlichen Scholastik
Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt
February 01, 2023
Digital History-Forschungskolloquium, Humboldt-Universität zu Berlin, Berlin
# 2. Aspiration für Zitatsnetwork mit dem traditionellen Apparatus Fontium
# 3. Entdeckung der Änlichkeit mit N-Grams
"Die Katze ist auf der Matte" hat drei 4-grams
1. Die Katze ist auf
2. Katze ist auf der
3. ist auf der Matte
In “description” Logik, haben wir die Folgende: “Ngram.isFoundIn.Paragraph”
z.B.
Sctar:videturquodnonsic sctap:isFoundIn sctar:para1; sctar:para5; sctar:para10; sctar:para21
![intersection](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-02-01/image5.png)
X is related to Y, if and only if
$$ \\#\\{ a | \forall{n}\forall{x}\forall{y}(IsFoundIn(n,x) \land IsFoundIn(n,y) \land x \neq y \\} >= 6 $$
Document Vectors A * B = DotProductVector = Intersection Vector (V)
$$Intersection Vector Summation = \sum_{i=1}^{n} v_i$$
If Intersection Vector Summation >= 6, then Doc Vectors A and B are "similar"
"Bag of N-Grams Model"
Paragraph | 4gram1 | 4gram2 | 4gram3 | 4gram4 | 4gram5 | 4gram6 | 4gram7 | 4gram8 | 4gram9 | 4gram10 | 4gram11 | 4gram12 |
Doc A | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
Doc B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Dot Product Vector | 1x1=1 | 1x1=1 | 0x1=0 | 1x0=0 | 0x1=0 | 1x1=1 | 0x0=1 | 1x1=1 | 0x0=0 | 0x1=0 | 1x1=1 | 1x1=1 |
```
SELECT (COUNT(*) as ?count) ?start ?target
WHERE {
?ngram ?start .
?ngram ?target .
FILTER(?start != ?target) .
}
GROUP BY ?start ?target
HAVING (?count >= 6)
```
# 4. N-gram Visualisierung: Erster Versuch
# 5. N-gram Visualization: Zweiter Versuch
## 5.1. Beispiel 1: Zitatsmuster-Entdeckung in der Tradition von den Sentenzen Kommentaren
## 5.2. Beispiel 2: "Uncited Successive Passage Re-Use"
“Petrus Gracilis...followed not only the footsteps but the very phrases of Hiltalingen in a way so deceptive that it does not cast the best light on Gracilis. He read secundum Hiltalingen without ever mentioning him. Only by a **lucky coincidence** [emphasis mine] was I enabled to "unmask" Gracilis' dubious literary honesty. (See Trapp, Damasus, "Augustinian Theology of the 14th Century," Augustiniana 6 (1956): 147-274, p. 254.)
||5|6|7|8|9|10|11|12|
|---|---|---|---|---|---|---|---|---|---|
|2|||||||||
|3|||||||||
|4||||x|||||
|5|||||x||||
|6||||||x|||
|7|||||||||
|8|||||||||
|9|||||||||
$$ SuccessiveReuse(t) = $$
$$ \forall{x_n}\forall{y_m}(R(x_{n}, y_{m}) \land R(x_{n+1}, y_{m+1}) \land R(x_{n+2},y_{m+2})) $$
$$ SuccessiveReuse(t) = $$
$$ \forall{x_n}\forall{y_m}(R(x_n,y_m) $$
$$ \land (R(x_{n+1},y_{m+1}) \lor R(x_{n+2},y_{m+2})) $$
$$ \land (R(x_{n+3},y_{m+2}) \lor (R(x_{n+4},y_{m+3})) $$
$$ \land (R(x_{n+3},y_{m+3}) \lor (R(x_{n+4},y_{m+4})) $$
SubstantialSuccessiveReuse(t) =
$$ \\#\\{ a | \forall{t}(SuccessveReuse(t)\\} >= 10 $$
where t = Question or Chapter