A word-level Language Identification (LID) tool for Tagalog-English (Taglish) text
TagLID is a library that labels each word in a Taglish (Tagalog-English mix)
text by language. It gives either a simple tag (tgl
or eng
) or detailed
frequency info with flags indicating how the word was identified. It is a
rule-based and opinionated system that mostly uses dictionary lookups. It also
handles cases like skipping numbers, names, and interjections, and includes
logic for dealing with slang, abbreviations, contractions, stemming or
lemmatizing inflected words, intrawords, and correcting misspellings.
pip install git+https://github.com/andrianllmm/taglid.git@main
TagLID can act as a standalone library that can be imported via import taglid
or as a CLI application via python -m taglid
.
Use the lid
module for textual data.
Use lang_identify
to identify each word in a text. This takes any string and
returns a list of words and their corresponding English and Tagalog values,
flag, and correction.
from taglid.lid import lang_identify
labeled_text = lang_identify("hello, mundo")
print(labeled_text)
Output:
[{'Word': 'hello', 'eng': 1.0, 'tgl': 0.0, 'Flag': 'DICT', 'Correction': None}, {'Word': 'mundo', 'eng': 0.0, 'tgl': 1.0, 'Flag': 'DICT', 'Correction': None}]
Use tabulate
to view output in tabular
format.
from tabulate import tabulate
print(tabulate(labeled_text, headers="keys"))
Output:
word eng tgl flag correction
------ ----- ----- ------ ------------
hello 1 0 DICT
mundo 0 1 DICT
Use simplify
to only show the words and their language. This takes the return
value of lang_identify
and returns a list of tuples containing the word and
its language.
from taglid.lid import simplify
simplified_text = simplify(labeled_text)
print(simplified_text)
Output:
[('hello', 'eng'), ('mundo', 'tgl')]
Use the lid_dataset
module for datasets.
Use lang_identify_df
to label each word in each cell in a
pandas
DataFrame. This takes a DataFrame
of multiple rows and columns with each cell containing textual data and returns
a labeled DataFrame where each token is a row labeled by its original row,
original column, and token index.
import pandas as pd
from taglid.lid_dataset import lang_identify_df
data = [['hello po', 'ano?'], ['mag-aask lang po', 'what?']]
df = pd.DataFrame(data)
labeled_df = lang_identify_df(df)
print(labeled_df)
Output:
col token_index word eng tgl flag correction
row
0 0 1 hello 1.0 0.0 DICT None
0 0 2 po 0.0 1.0 DICT None
0 1 1 ano 0.0 1.0 FREQ None
1 0 1 mag-aask 0.5 0.5 INTW None
1 0 2 lang 0.0 1.0 FREQ None
1 0 3 po 0.0 1.0 DICT None
1 1 1 what 1.0 0.0 DICT None
Run TagLID from the terminal.
python -m taglid.lid
Then type a sentence when prompted.
text: hello, mundo
Output:
word eng tgl flag correction
------ ----- ----- ------ ------------
hello 1 0 DICT
mundo 0 1 DICT
Add --simplify
to only show the words and their language.
python -m taglid.lid --simplify --text hello, mundo
Output:
----- ---
hello eng
mundo tgl
----- ---
Use lid_dataset
with Excel files to directly label spreadsheets.
python -m taglid.lid_dataset in_path out_path
The accuracy hasn't been tested yet.
Contributions are welcome! To get started:
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a pull request
Found a bug or issue? Report it on the issues page.