Skip to content

Add direct syslog logging per workflow/task #340

@jbbarth

Description

@jbbarth

Today simpleflow only logs to stdout and with little contextual information on the performed actions. It falls flat when trying to identify logs from a specific workflow or activity at a given point in time, given that machines often perform different tasks simultaneously.

I'd like to add a second handler, optional, to a remote syslog target (think rsyslog remote server, or a hosted service such as logmatic or papertrail).

It would log things in the following format (please criticize):

# for tasks performed "inside" a workflow
<workflow_id>: <activity|decision>#<started_event_id> <log level> pid=<pid> <message>

# for tasks performed "outside" a workflow (workers initialization, etc.)
<message>
# or (tell me?)
simpleflow: <message>

The result would look like this in papertrail or your favorite syslog server:

May 14 10:22:19 127.0.0.1 logger: INFO pid=3817 using task list basic-bcb8a2b0092072ef-ca7d812541f544fe964ded26ae5cc8db 
May 14 10:22:19 127.0.0.1 logger: INFO pid=3823 starting <bound method ActivityPoller.start of ActivityPoller(domain=TestDomain, task_list=basic-bcb8a2b0092072ef-ca7d812541f544fe964ded26ae5cc8db)> 
May 14 10:22:19 127.0.0.1 logger: INFO pid=3822 starting <bound method DeciderPoller.start of DeciderPoller(TestDomain, basic-bcb8a2b0092072ef-ca7d812541f544fe964ded26ae5cc8db, basic)> 
May 14 10:22:19 127.0.0.1 logger: INFO pid=3824 starting ActivityPoller(task_list=basic-bcb8a2b0092072ef-ca7d812541f544fe964ded26ae5cc8db) on domain TestDomain 
May 14 10:22:19 127.0.0.1 logger: INFO pid=3825 starting DeciderPoller(workflow=basic) on domain TestDomain 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#3 INFO pid=3826 taking decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#3 INFO pid=3826 execution context from decider: {'workflow_id': u'basic-bcb8a2b0092072ef', 'version': u'example', 'name': u'basic', 'run_id': u'22ukk/DSnnnnnnnnnS5OR7mOxdUg1Y=', 'parent_workflow_id': None, 'tag_list': [u'a=1', u'b=foo'], 'parent_run_id': None, 'domain_name': u'TestDomain', 'continued_execution_run_id': None} 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#3 INFO pid=3826 1 open activities (1 decisions) 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#3 INFO pid=3826 completing decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: activity#6 INFO pid=3827 activity context: {'activity_id': u'activity-examples.basic.increment-1', 'name': u'examples.basic.increment', 'run_id': u'22ukk/DSnnnnnnnnnS5OR7mOxdUg1Y=', 'domain_name': u'TestDomain', 'workflow_id': u'basic-bcb8a2b0092072ef', 'version': u'example', 'input': u'{"args":[1],"kwargs":{}}'} 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#9 INFO pid=3828 taking decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:19 127.0.0.1 basic-bcb8a2b0092072ef: decision#9 INFO pid=3828 execution context from decider: {'workflow_id': u'basic-bcb8a2b0092072ef', 'version': u'example', 'name': u'basic', 'run_id': u'22ukk/DSnnnnnnnnnS5OR7mOxdUg1Y=', 'parent_workflow_id': None, 'tag_list': [u'a=1', u'b=foo'], 'parent_run_id': None, 'domain_name': u'TestDomain', 'continued_execution_run_id': None} 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#9 INFO pid=3828 2 open activities (2 decisions) 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#9 INFO pid=3828 completing decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#16 INFO pid=3831 taking decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#16 INFO pid=3831 execution context from decider: {'workflow_id': u'basic-bcb8a2b0092072ef', 'version': u'example', 'name': u'basic', 'run_id': u'22ukk/DSnnnnnnnnnS5OR7mOxdUg1Y=', 'parent_workflow_id': None, 'tag_list': [u'a=1', u'b=foo'], 'parent_run_id': None, 'domain_name': u'TestDomain', 'continued_execution_run_id': None} 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#16 INFO pid=3831 result of (1 + 1) * 2 = 4 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#16 INFO pid=3831 1 open activities (0 decisions) 
May 14 10:22:20 127.0.0.1 basic-bcb8a2b0092072ef: decision#16 INFO pid=3831 completing decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:50 127.0.0.1 basic-bcb8a2b0092072ef: decision#21 INFO pid=3843 taking decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:50 127.0.0.1 basic-bcb8a2b0092072ef: decision#21 INFO pid=3843 execution context from decider: {'workflow_id': u'basic-bcb8a2b0092072ef', 'version': u'example', 'name': u'basic', 'run_id': u'22u4Mb2YpWB4lVc4ZabGNkk/DSPnRMocS5OR7mOxdUg1Y=', 'parent_workflow_id': None, 'tag_list': [u'a=1', u'b=foo'], 'parent_run_id': None, 'domain_name': u'TestDomain', 'continued_execution_run_id': None} 
May 14 10:22:50 127.0.0.1 basic-bcb8a2b0092072ef: decision#21 INFO pid=3843 result of (1 + 1) * 2 = 4 
May 14 10:22:50 127.0.0.1 basic-bcb8a2b0092072ef: decision#21 INFO pid=3843 completing decision for workflow basic-bcb8a2b0092072ef (basic) 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3823 process: caught signal signal=SIGTERM pid=3823 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3823 process: will stop workers, this might take up several minutes. Please, be patient. 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3823 process: sending SIGTERM to pid=3824 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3823 process: waiting for proces=psutil.Process(pid=3824, name='Python') to finish. 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3824 process: caught signal signal=SIGTERM pid=3824 
May 14 10:22:51 127.0.0.1 logger: INFO pid=3824 stopping ActivityPoller(task_list=basic-bcb8a2b0092072ef-ca7d812541f544fe964ded26ae5cc8db)

Remarks:

  • here logger as the default program name, and some logs for a basic-bcb8a2b0092072ef workflow
  • domain is supposed to be obvious ; in Botify case domain matches the "environment" and given a specific machine, we know the SWF domain it uses directly ; YMMV
  • we could log a handful of other things but logs become very noisy and hard to read by a human (task list, activity id, activity name, etc.) => better have it in a tool above that, and this can always be changed later by making the logging pattern more flexible
  • I didn't test how bare client logging.getLogger().info("foo") behave, especially in pypy forks when using simpleflow.execute
  • I tested it with logging.handlers.SysLogHandler which has the following limitations: silent errors when logging with TCP (or maybe papertrail's fault?), explicit errors on UDP when above the max packet size (~9k bytes on my machine, but depends on MTU and remote syslog server I guess) ; no TCP+TLS support
  • logging with syslog introduce another limitation, that "program name" is the "workflow id" in our case ; in syslog program name is limited to 64 chars, but workflow id in SWF is limited to 256 ; in practice we don't have much workflows over 80 chars at Botify and they're all unique when truncated to 52 chars, but YMMV
  • the implementation would rely on environment variables which are 1/ non-intrusive and 2/ easy to retrieve in subprocesses, optionally not from simpleflow: any "poll_for_(activity|decision)_task" call would set a few environment variables, that the logger or any other program could use

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions