Skip to content

feat(ingest): add structured log type #14229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

anshbansal
Copy link
Collaborator

Idea is that sometimes our log messages are not helpful for self-serve. Adding a type to the message tells everyone that it is related to LINEAGE. We can decide on more types in the future and slowly add this log_type in our logging so it is easier to self-serve understand the impact of various logs. This should be driven by where folks are unable to self-serve using our log messages.

Just added in mock data source for now to show + Fivetran where I recently faced this so I know it affects lineage.

It shows in logs like this

datahub ingest -c ../tmp/datahub_mock_data.dhub.yaml
[2025-07-25 18:31:16,384] INFO     {datahub.cli.ingest_cli:151} - DataHub CLI version: unavailable (installed in develop mode)
[2025-07-25 18:31:16,410] INFO     {datahub.ingestion.run.pipeline:225} - Sink configured successfully. 
[2025-07-25 18:31:16,503] INFO     {datahub.ingestion.run.pipeline:254} - Source configured successfully.
[2025-07-25 18:31:16,503] INFO     {datahub.cli.ingest_cli:132} - Starting metadata ingestion
|[2025-07-25 18:31:18,046] ERROR    {datahub.ingestion.source.mock_data.datahub_mock_data:167} - Test Error: This is test error message => This is test error 0
[2025-07-25 18:31:18,047] WARNING  {datahub.ingestion.source.mock_data.datahub_mock_data:175} - Test Warning: This is test warning => This is test warning 0
[2025-07-25 18:31:18,050] INFO     {datahub.cli.ingest_cli:145} - Finished metadata ingestion
\
Cli report:
{'cli_version': 'unavailable (installed in develop mode)',
 'cli_entry_location': '/Users/aseembansal/code/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py',
 'models_version': 'bundled',
 'py_version': '3.10.16 (main, Jan 18 2025, 09:48:57) [Clang 16.0.0 (clang-1600.0.26.6)]',
 'py_exec_path': '/Users/aseembansal/code/datahub/metadata-ingestion/venv/bin/python3',
 'os_details': 'macOS-15.5-arm64-arm-64bit',
 'mem_info': '122.62 MB',
 'peak_memory_usage': '122.62 MB',
 'disk_info': {'total': '994.66 GB', 'used': '325.72 GB', 'used_initally': '325.72 GB', 'free': '668.94 GB'},
 'peak_disk_usage': '325.72 GB',
 'thread_count': 4,
 'peak_thread_count': 4}
Source (datahub-mock-data) report:
{'aspects': {},
 'samples': {},
 'event_not_produced_warn': True,
 'events_produced': 0,
 'events_produced_per_sec': 0,
 'start_time': '2025-07-25 18:31:16.503604 (1.92 seconds ago)',
 'running_time': '1.92 seconds',
 'failures': [{'title': 'Test Error', 'message': 'This is test error message', 'context': ['This is test error 0']}],
 'warnings': [{'title': 'Test Warning', 'message': 'This is test warning', 'context': ['This is test warning 0'], 'log_type': 'LINEAGE'}],
 'infos': []}
Sink (console) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2025-07-25 18:31:16.410185 (2.01 seconds ago)',
 'current_time': '2025-07-25 18:31:18.424909 (now)',
 'total_duration_in_seconds': 2.01}

Pipeline finished with at least 1 failures; produced 0 events in 1.92 seconds.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 25, 2025
Copy link

codecov bot commented Jul 25, 2025

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
3415 2 3413 45
View the full list of 2 ❄️ flaky tests
tests.lineage.test_lineage_sdk::test_column_level_lineage_from_schema_field

Flake rate in main: 10.06% (Passed 143 times, Failed 16 times)

Stack Traces | 0.021s run time
test_client = <datahub.sdk.main_client.DataHubClient object at 0x7f680b012590>
test_datasets = {'downstream1': Dataset('urn:li:dataset:(urn:li:dataPlatform:snowflake,test_lineage_downstream_001,PROD)'), 'downstrea...ream_003,PROD)'), 'upstream': Dataset('urn:li:dataset:(urn:li:dataPlatform:snowflake,test_lineage_upstream_001,PROD)')}

    def test_column_level_lineage_from_schema_field(
        test_client: DataHubClient, test_datasets: Dict[str, Dataset]
    ):
        source_schema_field = SchemaFieldUrn(test_datasets["upstream"].urn, "id")
        column_lineage_results = test_client.lineage.get_lineage(
            source_urn=str(source_schema_field), direction="downstream", max_hops=3
        )
    
>       assert len(column_lineage_results) == 3
E       assert 0 == 3
E        +  where 0 = len([])

tests/lineage/test_lineage_sdk.py:203: AssertionError
tests.lineage.test_lineage_sdk::test_column_level_lineage

Flake rate in main: 11.95% (Passed 140 times, Failed 19 times)

Stack Traces | 10.3s run time
test_client = <datahub.sdk.main_client.DataHubClient object at 0x7f680b012590>
test_datasets = {'downstream1': Dataset('urn:li:dataset:(urn:li:dataPlatform:snowflake,test_lineage_downstream_001,PROD)'), 'downstrea...ream_003,PROD)'), 'upstream': Dataset('urn:li:dataset:(urn:li:dataPlatform:snowflake,test_lineage_upstream_001,PROD)')}

    def test_column_level_lineage(
        test_client: DataHubClient, test_datasets: Dict[str, Dataset]
    ):
        column_lineage_results = test_client.lineage.get_lineage(
            source_urn=str(test_datasets["upstream"].urn),
            source_column="id",
            direction="downstream",
            max_hops=3,
        )
    
>       assert len(column_lineage_results) == 3
E       assert 0 == 3
E        +  where 0 = len([])

tests/lineage/test_lineage_sdk.py:152: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jul 25, 2025
@github-actions github-actions bot requested a deployment to datahub-wheels (Preview) July 25, 2025 13:11 Abandoned
@github-actions github-actions bot requested a deployment to datahub-wheels (Preview) July 25, 2025 13:15 Abandoned
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants