Text Re-Use Detection mit Ngrams und Graphen.

Zur Analyse von Intertextualität in den Texten und Kommentaren der mittelalterlichen Scholastik


Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt


February 01, 2023

Digital History-Forschungskolloquium, Humboldt-Universität zu Berlin, Berlin

Slide Deck: http://jeffreycwitt.com/slides/2023-02-01-ngrams

https://creativecommons.org/licenses/by-nc-sa/4.0/

# 1. Einleitung
# 2. Aspiration für Zitatsnetwork mit dem traditionellen Apparatus Fontium
# 3. Entdeckung der Änlichkeit mit N-Grams
"Die Katze ist auf der Matte" hat drei 4-grams 1. Die Katze ist auf 2. Katze ist auf der 3. ist auf der Matte
In “description” Logik, haben wir die Folgende: “Ngram.isFoundIn.Paragraph” z.B. Sctar:videturquodnonsic sctap:isFoundIn sctar:para1; sctar:para5; sctar:para10; sctar:para21
![intersection](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-02-01/image5.png) X is related to Y, if and only if $$ \\#\\{ a | \forall{n}\forall{x}\forall{y}(IsFoundIn(n,x) \land IsFoundIn(n,y) \land x \neq y \\} >= 6 $$
Document Vectors A * B = DotProductVector = Intersection Vector (V) $$Intersection Vector Summation = \sum_{i=1}^{n} v_i$$ If Intersection Vector Summation >= 6, then Doc Vectors A and B are "similar"
"Bag of N-Grams Model"

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

Doc A

1

1

0

1

0

1

1

1

0

0

1

1

Doc B

1

1

1

0

1

1

1

1

0

1

1

1

Dot Product Vector

1x1=1

1x1=1

0x1=0

1x0=0

0x1=0

1x1=1

0x0=1

1x1=1

0x0=0

0x1=0

1x1=1

1x1=1

``` SELECT (COUNT(*) as ?count) ?start ?target WHERE { ?ngram ?start . ?ngram ?target . FILTER(?start != ?target) . } GROUP BY ?start ?target HAVING (?count >= 6) ```
# 4. N-gram Visualisierung: Erster Versuch
# 5. N-gram Visualization: Zweiter Versuch
## 5.1. Beispiel 1: Zitatsmuster-Entdeckung in der Tradition von den Sentenzen Kommentaren
## 5.2. Beispiel 2: "Uncited Successive Passage Re-Use"
“Petrus Gracilis...followed not only the footsteps but the very phrases of Hiltalingen in a way so deceptive that it does not cast the best light on Gracilis. He read secundum Hiltalingen without ever mentioning him. Only by a **lucky coincidence** [emphasis mine] was I enabled to "unmask" Gracilis' dubious literary honesty. (See Trapp, Damasus, "Augustinian Theology of the 14th Century," Augustiniana 6 (1956): 147-274, p. 254.)
||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2||||||||| |3||||||||| |4||||x||||| |5|||||x|||| |6||||||x||| |7||||||||| |8||||||||| |9|||||||||
$$ SuccessiveReuse(t) = $$ $$ \forall{x_n}\forall{y_m}(R(x_{n}, y_{m}) \land R(x_{n+1}, y_{m+1}) \land R(x_{n+2},y_{m+2})) $$
$$ SuccessiveReuse(t) = $$
$$ \forall{x_n}\forall{y_m}(R(x_n,y_m) $$ $$ \land (R(x_{n+1},y_{m+1}) \lor R(x_{n+2},y_{m+2})) $$ $$ \land (R(x_{n+3},y_{m+2}) \lor (R(x_{n+4},y_{m+3})) $$ $$ \land (R(x_{n+3},y_{m+3}) \lor (R(x_{n+4},y_{m+4})) $$
SubstantialSuccessiveReuse(t) = $$ \\#\\{ a | \forall{t}(SuccessveReuse(t)\\} >= 10 $$ where t = Question or Chapter
# Fragen und Diskussion