Languedoc Changeset - c0ac7e5989aa

Changeset - c0ac7e5989aa

Parent rev.

Child rev.

[Not reviewed]

tip default

0 1 1

Laman - 13 months ago 2024-11-26 22:16:11

added build-requirements.txt, updated readme

2 files changed with 8 insertions and 1 deletions:

README.md

build-requirements.txt

0 comments (0 inline, 0 general)

README.md

➞

Show inline comments

 Languedoc
 =========
 Language identification library based on ["N-Gram-Based Text Categorization"](https://www.cis.lmu.de/~stef/seminare/sprachenidentifizierung/cavnar_trenkle.pdf) by Cavnar and Trenkle.
 ## Usage
 . Create a directory `data`, with a subdirectory for each target language. Fill with your training data.
 . Run `PYTHONPATH=src/ python3 train.py`. It will create a `models.json.gz` file.
 . You can now use:
 . Build and install the package:
 ```
 python3 -m build
 pip install dist/languedoc-...-py3-none-any.whl
 ```
 . You can now use:
 ```python
 import languedoc
 language = languedoc.identify("A text you want to identify.")
 ```
 It will output the identifier that you used as the subdirectory name in step 1, based on the closest match between n-gram frequencies.
 ## Accuracy
 Below is the training script output from my training data, from seven major European languages. It is worth noting that the crossvalidation iterates through all languages, for each creates five models and for each creates ten tests of every length of 8, 16, 32 and 64 characters. If we count the misidentified samples, we can see that the tiny 8 character samples have 84% success rate, 16 chars rise to 96% and for 32 and longer there are no errors at all.
 ```

build-requirements.txt

➞

Show inline comments

		new file 100644
		build

0 comments (0 inline, 0 general)