Laman | c0ac7e5989aa |
5 months ago
|
|||
Laman | d10cf4335b76 |
7 months ago
|
|||
Laman | 4ea2a5eb6cf4 |
7 months ago
|
|||
Laman | f1db051d658e |
7 months ago
|
|||
Laman | f82e9a5b1c2c |
7 months ago
|
|||
Laman | ee446af216d7 |
21 months ago
|
|||
Laman | 3998161856de |
21 months ago
|
|||
Laman | 9c518a47ef7f |
23 months ago
|
|||
Laman | b532eec28d30 |
23 months ago
|
|||
Laman | ba1303bfd58c |
23 months ago
|
Languedoc
Language identification library based on "N-Gram-Based Text Categorization" by Cavnar and Trenkle.
Usage
-
Create a directory
data
, with a subdirectory for each target language. Fill with your training data. -
Run
PYTHONPATH=src/ python3 train.py
. It will create amodels.json.gz
file. -
Build and install the package:
python3 -m build
pip install dist/languedoc-...-py3-none-any.whl
- You can now use:
import languedoc
language = languedoc.identify("A text you want to identify.")
It will output the identifier that you used as the subdirectory name in step 1, based on the closest match between n-gram frequencies.
Accuracy
Below is the training script output from my training data, from seven major European languages. It is worth noting that the crossvalidation iterates through all languages, for each creates five models and for each creates ten tests of every length of 8, 16, 32 and 64 characters. If we count the misidentified samples, we can see that the tiny 8 character samples have 84% success rate, 16 chars rise to 96% and for 32 and longer there are no errors at all.
PYTHONPATH=src/ python src/languedoc/train.py
# Source texts:
cs: dyk - krysař.txt (94122 chars)
cs: hašek - švejk.txt (679701 chars)
cs: poláček - hostinec u kamenného stolu.txt (434082 chars)
cs: vančura - konec starých časů.txt (418857 chars)
cs: čapek - apokryfy.txt (180550 chars)
de: 2188-8.txt (351444 chars)
de: 23396-0.txt (187138 chars)
de: 46896-8.txt (248309 chars)
de: pg10917.txt (384020 chars)
de: pg67409.txt (333959 chars)
en: fitzgerald - the great gatsby.txt (274897 chars)
en: joyce - ulysses.txt (1469511 chars)
en: lovecraft - the dunwich horror.txt (117020 chars)
en: orwell - 1984.txt (569311 chars)
en: woolf - mrs dalloway.txt (346294 chars)
es: 14307-8.txt (67263 chars)
es: 16670-8.txt (555534 chars)
es: 51019-0.txt (420431 chars)
es: 58484-8.txt (243137 chars)
es: 61189-8.txt (438149 chars)
fr: 44468-0.txt (360878 chars)
fr: 45176-0.txt (331838 chars)
fr: 64274-0.txt (108402 chars)
fr: pg68138.txt (380198 chars)
fr: pg68265.txt (368201 chars)
it: 22642-8.txt (467018 chars)
it: 28144-8.txt (285603 chars)
it: 39289-0.txt (472982 chars)
it: 49310-0.txt (295664 chars)
it: 57040-0.txt (382809 chars)
ru: Full text of История России Кириллов В.В Уч Пос 2007 661с ( 1) (1008316 chars)
ru: Full text of Каменев П. H. Сканави A. H. Богословский В. H. И Др. Часть 1. Отопление. 1975 (985388 chars)
ru: molier.txt (387588 chars)
# Crossvalidation:
cs misidentified as fr: trumm té
cs misidentified as fr: cit a l
cs misidentified as fr: en je ta
cs misidentified as en: l hostin
cs misidentified as fr: t a lidé
cs misidentified as it: nazaret
de misidentified as en: ch war f
de misidentified as it: r professor an e
de misidentified as en: en im hotel sond
de misidentified as es: so viel
en misidentified as de: j eckle
en misidentified as es: or prete
en misidentified as it: e stale
en misidentified as es: es agita
en misidentified as fr: connect
en misidentified as de: ust be after eig
en misidentified as fr: man pau
en misidentified as it: a puzzle
en misidentified as de: gs matte
en misidentified as de: ism and
es misidentified as en: se alarm
es misidentified as it: reserva
es misidentified as fr: ues de l
es misidentified as de: deber se
es misidentified as it: ronto la
es misidentified as fr: ue de le
es misidentified as it: no volv
es misidentified as it: sa perdi
es misidentified as fr: ailes de
es misidentified as it: e no lle
es misidentified as fr: luis sube un ra
es misidentified as it: ase a la iglesia
es misidentified as fr: t el rec
es misidentified as it: antonio
es misidentified as it: lama baltasar ti
es misidentified as fr: un galope que se
es misidentified as it: viduo un
es misidentified as fr: les con
es misidentified as it: escote ni desnud
es misidentified as fr: a encont
es misidentified as fr: vez vinu
es misidentified as en: se han
es misidentified as it: erpo de
es misidentified as fr: la caus
es misidentified as en: os bigot
es misidentified as fr: héroes balzaqui
es misidentified as fr: ue me marchara d
es misidentified as fr: va de génova voy
es misidentified as en: ol madrid me int
fr misidentified as es: le regar
fr misidentified as en: s inter
fr misidentified as it: de costa
fr misidentified as es: garde co
fr misidentified as es: recueill
fr misidentified as es: ale quel
fr misidentified as en: les offi
fr misidentified as it: e compar
fr misidentified as de: öhrenbac
fr misidentified as it: promena
fr misidentified as es: horizon
it misidentified as es: nica per
it misidentified as es: lla stan
it misidentified as es: uesto chablis ti
it misidentified as fr: onna cla
it misidentified as fr: va vent
it misidentified as fr: n si des
it misidentified as fr: endere le import
it misidentified as es: el tempo
it misidentified as es: a lo vid
Accuracy: 0.9507%, (1331/1400 tests during crossvalidation)
Licensed under GNU GPLv3.