Text Reuse Detection with N-Grams and Graphs
in 5 minutes


Jeffrey C. Witt (Loyola University Maryland)
https://jeffreycwitt.com | jcwitt@loyola.edu
@jeffreycwitt


June 06, 2023, DHSI, University of Victoria, Victoria, BC, Canada

Slide Deck: http://jeffreycwitt.com/slides/2023-06-06-dhsi-text-reuse-detection

https://creativecommons.org/licenses/by-nc-sa/4.0/

“Petrus Gracilis...followed not only the footsteps but the very phrases of Hiltalingen in a way so deceptive that it does not cast the best light on Gracilis. He read secundum Hiltalingen without ever mentioning him. Only by a **lucky coincidence** [emphasis mine] was I enabled to "unmask" Gracilis' dubious literary honesty. (See Trapp, Damasus, "Augustinian Theology of the 14th Century," Augustiniana 6 (1956): 147-274, p. 254.)
"The cat is on the mat" --- 4-grams --- "the cat is on" "cat is on the" "is on the mat"

Paragraph (Document) Feature Vector
as
Vector recording ngram presence (1) or absence (0)

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

Doc A

1

1

0

1

0

1

1

1

0

0

1

1

Intersection of shared ngrams in Docs A and B
as
Dot Product of Document Feature Vectors $$ A \cdot B = \sum\limits_{i=1}^{n}{A_i B_i} $$ If Dot Product of A and B >= 6, then Doc Vectors A and B are "similar"

Paragraph

4gram1

4gram2

4gram3

4gram4

4gram5

4gram6

4gram7

4gram8

4gram9

4gram10

4gram11

4gram12

sum

Doc A

1

1

0

1

0

1

1

1

0

0

1

1

8

Doc B

1

1

1

0

1

1

1

1

0

1

1

1

10

A * B (Dot Product Vector)

1x1=1

1x1=1

0x1=0

1x0=0

0x1=0

1x1=1

0x0=0

1x1=1

0x0=0

0x1=0

1x1=1

1x1=1

6

![intersection](https://s3.amazonaws.com/lum-faculty-jcwitt-public/2023-02-01/image5.png) Similarity = X is similar to Y, if and only if $$ \\#\\{ a | \forall{ng}\forall{x}\forall{y}(IsFoundIn(ng,x) \land IsFoundIn(ng,y) \land x \neq y \\} >= n $$ where n = 6

SuccessiveReuse = Windows with Convolution Score >= 3

||5|6|7|8|9|10|11|12| |---|---|---|---|---|---|---|---|---|---| |2|0|0|0|0|0|0|0|0| |3|0|0|**1**|**0**|**0**|**0**|0|0| |4|0|0|**0**|**0**|**0**|**0**|0|0| |5|0|0|**0**|**0**|**1**|**0**|0|0| |6|0|0|**0**|**1**|**0**|**1**|0|0| |7|0|0|0|0|0|0|0|0| |8|0|0|0|0|0|0|0|0| |9|0|0|0|0|0|0|0|0|

X

||||| |---|---|---| |1|0|0|0| |0|1|0|0| |0|0|1|0| |0|0|0|1|

=

||||| |---|---|---| |1|0|0|0| |0|0|0|0| |0|0|1|0| |0|0|0|1|

=

3

## Questions/Discussion