Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ pipeline {
disableConcurrentBuilds(abortPrevious: true)
}
environment {

AR_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-24-24-0'
DE_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/10-23-24-0'
EN_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/09-25-25-0'
Expand All @@ -27,7 +26,7 @@ pipeline {
HY_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/03-12-24-0'
MR_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/03-12-24-1'
JA_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/10-17-24-1'
HI_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/04-22-25-0'
HI_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/10-31-25-0'
DEFAULT_TN_CACHE='/home/jenkins/TestData/text_norm/ci/grammars/06-08-23-0'
}
stages {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
हफ़्ते
सप्ताह
सदियां
सदियों

Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@ h घंटे
min मिनट
doz दर्जन
yr साल
yr वर्ष
hp हॉर्सपॉवर
d दिन
month महीना
months महीने
हफ़्ते हफ़्ते

Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,6 @@ KHz किलोहर्ट्ज़
N न्यूटन
dB डेसीबल
yr साल
yr वर्ष
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deletion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate mappings (yr → had two different targets)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have multiple targets just add a guard for non-determinism

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep this in mind. I went through the code and currently its just a simple mapping. The two mappings are also just formal and informal variants, so for now we’re scoping only for a single, deterministic approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a common term? If so it should be added nonetheless

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to leave it for now with one single output, especially since we've already merged. Let's revisit if/when we go for nondet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure the enxt time we'll have dedicated help for Hindi so want to make sure the necessary stuff is added. If it's not common then we can remove, but if it's going to make outputs funky without it there should be an addition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

साल (informal) is much more frequent (85%) and वर्ष (formal) is generally used in scientific/religious texts. In speech the formal one is even more rare.

Coming back to our use case, there are 10/455 blind tests that have (yr) and they are in this format: <number> yr

The only way that I can think of to differentiate is to make it formal if the number is large or a decimal. If that sounds good, I'll add this change in one of my future PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good call. @mgrafu do you concur?

hp हॉर्सपॉवर
d दिन
month महीना
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
१ला पहला
१ली पहली
२रा दूसरा
२री दूसरी
३रा तीसरा
३री तीसरी
४था चौथा
४थी चौथी
५वां पाँचवां
५वीं पाँचवीं
६ठा छठा
६ठी छठी
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
वां
वीं
वें
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
वे वें

Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
नंबर
कार्ड
क्रेडिट
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
नंबर
मोबाइल
फोन
लैंडलाइन
कॉल
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
नंबर
मोबाइल
फोन
कॉल
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
0 शून्य
1 एक
2 दो
3 तीन
4 चार
5 पाँच
6 छह
7 सात
8 आठ
9 नौ
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
नंबर
पिन
कोड
पिनकोड
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
० एक
१ दो
२ तीन
३ चार
४ पाँच
५ छह
६ सात
७ आठ
८ नौ
९ दस
१० ग्यारह
११ बारह
१२ तेरह
१३ चौदह
१४ पंद्रह
१५ सोलह
१६ सत्रह
१७ अठारह
१८ उन्नीस
१९ बीस
२० इक्कीस
२१ बाईस
२२ तेईस
२३ चौबीस
२४ पच्चीस
२५ छब्बीस
२६ सत्ताईस
२७ अट्ठाईस
२८ उनतीस
२९ तीस
३० इकतीस
३१ बत्तीस
३२ तैंतीस
३३ चौंतीस
३४ पैंतीस
३५ छत्तीस
३६ सैंतीस
३७ अड़तीस
३८ उनतालीस
३९ चालीस
४० इकतालीस
४१ बयालीस
४२ तैंतालीस
४३ चौवालीस
४४ पैंतालीस
४५ छियालीस
४६ सैंतालीस
४७ अड़तालीस
४८ उनचास
४९ पचास
५० इक्यावन
५१ बावन
५२ तिरेपन
५३ चौवन
५४ पचपन
५५ छप्पन
५६ सत्तावन
५७ अट्ठावन
५८ उनसठ
५९ साठ
६० इकसठ
६१ बासठ
६२ तिरेसठ
६३ चौंसठ
६४ पैंसठ
६५ छियासठ
६६ सड़सठ
६७ अड़सठ
६८ उनहत्तर
६९ सत्तर
७० इकहत्तर
७१ बहत्तर
७२ तिहत्तर
७३ चौहत्तर
७४ पचहत्तर
७५ छिहत्तर
७६ सतहत्तर
७७ अठहत्तर
७८ उनासी
७९ अस्सी
८० इक्यासी
८१ बयासी
८२ तिरासी
८३ चौरासी
८४ पचासी
८५ छियासी
८६ सत्तासी
८७ अट्ठासी
८८ नवासी
८९ नब्बे
९० इक्यानबे
९१ बानबे
९२ तिरानबे
९३ चौरानबे
९४ पंचानबे
९५ छियानबे
९६ सत्तानबे
९७ अट्ठानबे
९८ निन्यानबे
९९ एक सौ
7 changes: 7 additions & 0 deletions nemo_text_processing/text_normalization/hi/graph_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@
NEMO_HI_DIGIT = pynini.union("०", "१", "२", "३", "४", "५", "६", "७", "८", "९").optimize()
NEMO_HI_NON_ZERO = pynini.union("१", "२", "३", "४", "५", "६", "७", "८", "९").optimize()
NEMO_HI_ZERO = "०"

HI_DEDH = "डेढ़" # 1.5
HI_DHAI = "ढाई" # 2.5
HI_SAVVA = "सवा" # quarter more (1.25)
HI_SADHE = "साढ़े" # half more (X.5)
HI_PAUNE = "पौने" # quarter less (0.75)

NEMO_LOWER = pynini.union(*string.ascii_lowercase).optimize()
NEMO_UPPER = pynini.union(*string.ascii_uppercase).optimize()
NEMO_ALPHA = pynini.union(NEMO_LOWER, NEMO_UPPER).optimize()
Expand Down
25 changes: 18 additions & 7 deletions nemo_text_processing/text_normalization/hi/taggers/cardinal.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@
import pynini
from pynini.lib import pynutil

from nemo_text_processing.text_normalization.hi.graph_utils import GraphFst
from nemo_text_processing.text_normalization.hi.graph_utils import GraphFst, insert_space
from nemo_text_processing.text_normalization.hi.utils import get_abs_path


class CardinalFst(GraphFst):
"""
Finite state transducer for classifying cardinals, e.g.
-२३ -> cardinal { negative: "true" integer: "तेइस" } }
s
Args:
deterministic: if True will provide a single transduction option,
for False multiple transduction are generated (used for audio-based normalization)
Finite state transducer for classifying cardinals, e.g.
-२३ -> cardinal { negative: "true" integer: "तेइस" }

Args:
deterministic: if True will provide a single transduction option,
for False multiple transduction are generated (used for audio-based normalization)
"""

def __init__(self, deterministic: bool = True, lm: bool = False):
Expand All @@ -37,6 +37,10 @@ def __init__(self, deterministic: bool = True, lm: bool = False):
teens_ties = pynini.string_file(get_abs_path("data/numbers/teens_and_ties.tsv"))
teens_and_ties = pynutil.add_weight(teens_ties, -0.1)

self.digit = digit
self.zero = zero
self.teens_and_ties = teens_and_ties

def create_graph_suffix(digit_graph, suffix, zeros_counts):
zero = pynutil.add_weight(pynutil.delete("०"), -0.1)
if zeros_counts == 0:
Expand Down Expand Up @@ -294,6 +298,12 @@ def create_larger_number_graph(digit_graph, suffix, zeros_counts, sub_graph):
graph_ten_shankhs |= create_larger_number_graph(teens_and_ties, suffix_shankhs, 0, graph_ten_padmas)
graph_ten_shankhs.optimize()

# Only match exactly 2 digits to avoid interfering with telephone numbers, decimals, etc.
# e.g., "०५" -> "शून्य पाँच"
single_digit = digit | zero
graph_leading_zero = zero + insert_space + single_digit
graph_leading_zero = pynutil.add_weight(graph_leading_zero, 0.5)

final_graph = (
digit
| zero
Expand All @@ -315,6 +325,7 @@ def create_larger_number_graph(digit_graph, suffix, zeros_counts, sub_graph):
| graph_ten_padmas
| graph_shankhs
| graph_ten_shankhs
| graph_leading_zero
)

optional_minus_graph = pynini.closure(pynutil.insert("negative: ") + pynini.cross("-", "\"true\" "), 0, 1)
Expand Down
15 changes: 6 additions & 9 deletions nemo_text_processing/text_normalization/hi/taggers/date.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,11 @@ def __init__(self, cardinal: GraphFst):
(NEMO_HI_DIGIT + NEMO_HI_NON_ZERO + NEMO_HI_DIGIT + NEMO_HI_DIGIT), cardinal.graph_hundreds_as_thousand
)

cardinal_graph = (
digit | teens_and_ties | cardinal.graph_hundreds | graph_year_thousands | graph_year_hundreds_as_thousands
cardinal_graph = pynini.union(
digit, teens_and_ties, cardinal.graph_hundreds, graph_year_thousands, graph_year_hundreds_as_thousands
)

graph_year = graph_year_thousands | graph_year_hundreds_as_thousands
graph_year = pynini.union(graph_year_thousands, graph_year_hundreds_as_thousands)

delete_dash = pynutil.delete("-")
delete_slash = pynutil.delete("/")
Expand Down Expand Up @@ -102,13 +102,10 @@ def __init__(self, cardinal: GraphFst):
# Updated logic to use prefix_union
year_prefix = pynutil.insert("era: \"") + prefix_union + insert_space + graph_year + pynutil.insert("\"")

graph_dd_mm_yyyy = (
days_graph + (delete_dash | delete_slash) + months_graph + (delete_dash | delete_slash) + years_graph
)
delete_separator = pynini.union(delete_dash, delete_slash)
graph_dd_mm_yyyy = days_graph + delete_separator + months_graph + delete_separator + years_graph

graph_mm_dd_yyyy = (
months_graph + (delete_dash | delete_slash) + days_graph + (delete_dash | delete_slash) + years_graph
)
graph_mm_dd_yyyy = months_graph + delete_separator + days_graph + delete_separator + years_graph

graph_mm_dd_yyyy += pynutil.insert(" preserve_order: true ")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,7 @@ class DecimalFst(GraphFst):
def __init__(self, cardinal: GraphFst, deterministic: bool = True):
super().__init__(name="decimal", kind="classify", deterministic=deterministic)

graph_digit = pynini.string_file(get_abs_path("data/numbers/digit.tsv"))
graph_digit |= pynini.string_file(get_abs_path("data/numbers/zero.tsv"))

graph_digit = cardinal.digit | cardinal.zero
cardinal_graph = cardinal.final_graph

self.graph = graph_digit + pynini.closure(insert_space + graph_digit).optimize()
Expand Down
Loading
Loading