Skip to content

Conversation

@alex-triphonov
Copy link

@alex-triphonov alex-triphonov commented Nov 12, 2020

I’ve found two main concurrent issues with dateparser:

I) TypeError: unsupported operand type(s) for -: 'NoneType' and 'relativedelta' #441 #813

Reason: Variable freshness_date_parser is a shared instance and its self.now was set to None by some threads while still being used by others.
Solution: Make a local variable now instead of self.now. There’s already an PR #814 with this solution, but unfortunately it stuck, so I had to add it to my code in order to pass tests.

II) Threads mix up date orders

Reason: Settings instances are shared between locales and while one thread in
_DateLocaleParser._try_parser() was parsing a date with date order picked for specific locale, other threads with the same Settings instances and different locales used shared _settings.DATE_ORDER before previous thread set default _settings.DATE_ORDER back, which is wrong for their locales and thus returned wrong dates.
Solution: Create separate Settings instances not only for different incoming settings, but for all other kwargs e.g. different locales, languages, etc.

other notes:
I noticed that the replace() method changed the original settings dict as it's a mutable object, so I’ve modified it to work with a copy() of it. As settings can be passed to kwargs as a None- added a check for this case.

@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #834 (6ea11c2) into master (2d5290e) will increase coverage by 0.00%.
The diff coverage is 95.34%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #834   +/-   ##
=======================================
  Coverage   98.37%   98.38%           
=======================================
  Files         231      231           
  Lines        2591     2595    +4     
=======================================
+ Hits         2549     2553    +4     
  Misses         42       42           
Impacted Files Coverage Δ
dateparser/freshness_date_parser.py 98.03% <88.23%> (-0.06%) ⬇️
dateparser/conf.py 100.00% <100.00%> (ø)
dateparser/search/text_detection.py 98.14% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a45b23e...0da185b. Read the comment docs.

@Gallaecio
Copy link
Member

After ~1h playing around with the changes, I think I’m starting to get a grasp of some of it.

At the moment there are 2 questions in my mind:

  • Would it me possible to split the change in 2, “now” and “settings”? Or does it need to be in the same pull request due to the tests?
  • I wonder if we should not consider a bigger rewrite of how settings work. I feel like the logic we have there to prevent duplicate setting objects is basically being voided by these changes. And even if that is not the case, I wonder if the optimization of avoiding duplicate objects is worth it. Maybe we need to rethink Settings as a whole.
  • Even with big changes, I would strive for backward compatibility. In their current form, these changes make some API changes which I would like to avoid, if possible. Unless we want to post-pone these changes to dateparser 2.0.

@alex-triphonov
Copy link
Author

alex-triphonov commented Nov 20, 2020

@Gallaecio

  1. Yes, tests will fail without now PR, as there's an assert for any caught exceptions.
  2. Absolutely agree, I guess settings in existing form were made to use less instances of settings and as this doesn't work i.e. leads to concurrency issues, whole settings conception should be reconsidered, but this is's an epic..
  3. What backward compatibility exactly concerns you? I thought I made minimum possible changes, that most users shouldn't feel.

if settings:
self._updateall(settings.items())
else:
def __init__(self, **kwargs):
Copy link
Collaborator

@noviluni noviluni Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio @alex-triphonov
would it be possible to keep the settings=None and, in case it's used, raise a deprecation warning and preserve the old behavior?

Something like:

def __init__(self, settings=None, **kwargs):
    if settings:
        # raise warning (deprecation warning, not concurrent) + old behavior
    else:
        # new behavior using kwargs

I know the code won't be nice, but I think in that way we could keep the backward compatibility until the next major version.

(You should do the same for get_key() and maybe something similar in the other two methods you touched).

Maybe I'm saying something stupid, sorry guys, I didn't have too much time to play with the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is not possible- settings are still being used, but in kwargs for more convenience, the only way to separate behaviours is to add some new kwarg like new_settings but is old behaviour with fewer Settings instances that much essential?

Copy link
Member

@Gallaecio Gallaecio Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it’s mostly a matter of API. The current change would require to postpone this to dateparser 2.x, as it is backward incompatible. If we can figure a backward compatible implementation, this can go on the next dateparser version.

@noviluni
Copy link
Collaborator

@alex-triphonov you did a good job here 💪, but as mentioned before we can't merge it as-is.

The point is not "how essential it is", the issue here is that when declaring a public class or function (in this case Settings and replace()) in a library used by other people you can’t change the behavior without increasing the “major” version because some users could be using the Settings class in their way and this would be a breaking change for them.

However, fixing multithreading is a must and a priority, so I will check it further to see if I find a way to support both "settings" versions to avoid a really premature dateparser==2.0.0.

thanks!

@alex-triphonov
Copy link
Author

alex-triphonov commented Nov 25, 2020

Ok I get it. So in order to add backward incompatibility, we can add a new flag to settings, like SAFE_CONCURRENCY
and then separate behaviours like:

def __init__(self, **kwargs):
    if kwargs.get('SAFE_CONCURRENCY'):
        # new behavior using kwargs
    else:
        # raise warning (deprecation warning, not concurrent) + old behavior

This is relatively easy to implement, but then everyone who doesn't care about concurrency
will have to add this flag in order to get rid of deprecation warning..
So maybe warning should be avoided then. What do you think?

@noviluni
Copy link
Collaborator

Hi @alex-triphonov

Ok I get it. So in order to add backward incompatibility, we can add a new flag to settings, like SAFE_CONCURRENCY
and then separate behaviours like:

def __init__(self, **kwargs):
    if kwargs.get('SAFE_CONCURRENCY'):
        # new behavior using kwargs
    else:
        # raise warning (deprecation warning, not concurrent) + old behavior

This is relatively easy to implement, but then everyone who doesn't care about concurrency
will have to add this flag in order to get rid of deprecation warning..
So maybe warning should be avoided then. What do you think?

Sorry, I think I would need to add some tests to fully understand if these changes related to settings creation are breaking changes or not.

But for the replace method it seems obvious, as we needed to change the tests, I would like this: settings = settings.replace(NORMALIZE=False) and this settings = settings.replace(settings={'NORMALIZE': False}) to work in the same way.

Couldn't be possible to do something like this?:

 def replace(self, mod_settings=None, **kwds):
     if any(kwarg for kwarg in kwds if kwarg not in ('settings')):
         # warning + old behavior
     else:
         # new behavior

@alex-triphonov
Copy link
Author

alex-triphonov commented Nov 30, 2020

But for the replace method it seems obvious, as we needed to change the tests, I would like this: settings = settings.replace(NORMALIZE=False) and this settings = settings.replace(settings={'NORMALIZE': False}) to work in the same way.

Couldn't be possible to do something like this?:

 def replace(self, mod_settings=None, **kwds):
     if any(kwarg for kwarg in kwds if kwarg not in ('settings')):
         # warning + old behavior
     else:
         # new behavior

No I don't think we can do it this way because we have kwargs besides settings: locales, date_formats etc..
Unless we add another check if this kwarg is supported by Settings..
I can add some additional tests if you describe your concerns.

@noviluni noviluni added this to the 2.0.0 milestone Nov 30, 2020
@alex-triphonov
Copy link
Author

alex-triphonov commented Dec 2, 2020

@noviluni added this to the 2.0.0 milestone 2 days ago

Does it mean adding different behaviour for Settings when using special key in it is not backwards compatible enough and we'll not see concurrency safety in near future?

@noviluni
Copy link
Collaborator

noviluni commented Dec 2, 2020

Hey @alex-triphonov

no, that milestone means that, as of now, the PR couldn't be merged as-is before a dateparser==2.0.0 version. Of course, if we fix it, we can merge it before.

It's ok to use the SAFE_CONCURRENCY as a temporary workaround, you can go ahead and implement it :)

@alex-triphonov
Copy link
Author

alex-triphonov commented Jan 12, 2021

@noviluni
So I returned to the topic and checked the possibility to add SAFE_CONCURRENCY flag to settings, and I need to clarifu something.

  1. The only change to Settings behaviour is creating separate Settings instances not only for different incoming settings, but for all other kwargs e.g. different locales, languages, etc. Do we want to separate it from old behaviour? because then we just should check this flag in get_key method and don't make extra instances.
  2. But for the replace method it seems obvious, as we needed to change the tests, I would like this: settings = settings.replace(NORMALIZE=False) and this settings = settings.replace(settings={'NORMALIZE': False}) to work in the same way.
    I can do it by moving settings_values from check_settings function and check kwargs in replace against it's keys. is it ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants