Changeset - c0ac7e5989aa
[Not reviewed]
tip default
0 1 1
Laman - 5 months ago 2024-11-26 22:16:11

added build-requirements.txt, updated readme
2 files changed with 8 insertions and 1 deletions:
0 comments (0 inline, 0 general)
README.md
Show inline comments
 
Languedoc
 
=========
 

	
 
Language identification library based on ["N-Gram-Based Text Categorization"](https://www.cis.lmu.de/~stef/seminare/sprachenidentifizierung/cavnar_trenkle.pdf) by Cavnar and Trenkle.
 

	
 
## Usage
 
1. Create a directory `data`, with a subdirectory for each target language. Fill with your training data.
 

	
 
2. Run `PYTHONPATH=src/ python3 train.py`. It will create a `models.json.gz` file.
 

	
 
3. You can now use:
 
3. Build and install the package:
 
```
 
python3 -m build
 
pip install dist/languedoc-...-py3-none-any.whl
 
```
 

	
 
4. You can now use:
 
```python
 
import languedoc
 

	
 
language = languedoc.identify("A text you want to identify.")
 
```
 
It will output the identifier that you used as the subdirectory name in step 1, based on the closest match between n-gram frequencies.
 

	
 
## Accuracy
 

	
 
Below is the training script output from my training data, from seven major European languages. It is worth noting that the crossvalidation iterates through all languages, for each creates five models and for each creates ten tests of every length of 8, 16, 32 and 64 characters. If we count the misidentified samples, we can see that the tiny 8 character samples have 84% success rate, 16 chars rise to 96% and for 32 and longer there are no errors at all.
 

	
 
```
 
PYTHONPATH=src/ python src/languedoc/train.py 
 
# Source texts:
 
cs: dyk - krysař.txt (94122 chars)
 
cs: hašek - švejk.txt (679701 chars)
 
cs: poláček - hostinec u kamenného stolu.txt (434082 chars)
 
cs: vančura - konec starých časů.txt (418857 chars)
 
cs: čapek - apokryfy.txt (180550 chars)
 
de: 2188-8.txt (351444 chars)
 
de: 23396-0.txt (187138 chars)
 
de: 46896-8.txt (248309 chars)
 
de: pg10917.txt (384020 chars)
 
de: pg67409.txt (333959 chars)
 
en: fitzgerald - the great gatsby.txt (274897 chars)
 
en: joyce - ulysses.txt (1469511 chars)
 
en: lovecraft - the dunwich horror.txt (117020 chars)
 
en: orwell - 1984.txt (569311 chars)
 
en: woolf - mrs dalloway.txt (346294 chars)
 
es: 14307-8.txt (67263 chars)
 
es: 16670-8.txt (555534 chars)
 
es: 51019-0.txt (420431 chars)
 
es: 58484-8.txt (243137 chars)
 
es: 61189-8.txt (438149 chars)
 
fr: 44468-0.txt (360878 chars)
 
fr: 45176-0.txt (331838 chars)
 
fr: 64274-0.txt (108402 chars)
 
fr: pg68138.txt (380198 chars)
 
fr: pg68265.txt (368201 chars)
 
it: 22642-8.txt (467018 chars)
 
it: 28144-8.txt (285603 chars)
 
it: 39289-0.txt (472982 chars)
 
it: 49310-0.txt (295664 chars)
 
it: 57040-0.txt (382809 chars)
 
ru: Full text of История России Кириллов В.В Уч Пос 2007 661с ( 1) (1008316 chars)
 
ru: Full text of Каменев П. H. Сканави A. H. Богословский В. H. И Др. Часть 1. Отопление. 1975 (985388 chars)
 
ru: molier.txt (387588 chars)
 

	
build-requirements.txt
Show inline comments
 
new file 100644
 
build
0 comments (0 inline, 0 general)