Client
The Kafka emitter or Rest emitter can be used to push metadata to DataHub. The DataHub graph client extends the Rest emitter with additional functionality.
- class datahub.emitter.rest_emitter.DataHubRestEmitter(gms_server, token=None, timeout_sec=None, connect_timeout_sec=None, read_timeout_sec=None, retry_status_codes=None, retry_methods=None, retry_max_times=None, extra_headers=None, ca_certificate_path=None, client_certificate_path=None, disable_ssl_verification=False)
Bases:
Closeable
,Emitter
- Parameters:
gms_server (
str
)token (
Optional
[str
])timeout_sec (
Optional
[float
])connect_timeout_sec (
Optional
[float
])read_timeout_sec (
Optional
[float
])retry_status_codes (
Optional
[List
[int
]])retry_methods (
Optional
[List
[str
]])retry_max_times (
Optional
[int
])extra_headers (
Optional
[Dict
[str
,str
]])ca_certificate_path (
Optional
[str
])client_certificate_path (
Optional
[str
])disable_ssl_verification (
bool
)
- test_connection()
- Return type:
None
- get_server_config()
- Return type:
dict
- to_graph()
- Return type:
- emit(item, callback=None)
- Parameters:
item (
Union
[MetadataChangeEventClass
,MetadataChangeProposalClass
,MetadataChangeProposalWrapper
,UsageAggregationClass
])callback (
Optional
[Callable
[[Exception
,str
],None
]])
- Return type:
None
- emit_mce(mce)
- Parameters:
mce (
MetadataChangeEventClass
)- Return type:
None
- emit_mcp(mcp)
- Parameters:
mcp (
Union
[MetadataChangeProposalClass
,MetadataChangeProposalWrapper
])- Return type:
None
- emit_usage(usageStats)
- Parameters:
usageStats (
UsageAggregationClass
)- Return type:
None
- flush()
- Return type:
None
- close()
- Return type:
None
- datahub.emitter.rest_emitter.DatahubRestEmitter
alias of
DataHubRestEmitter
- class datahub.emitter.kafka_emitter.KafkaEmitterConfig(**data)
Bases:
ConfigModel
- Parameters:
data (
Any
)connection (KafkaProducerConnectionConfig)
topic_routes (Dict[str, str])
-
connection:
KafkaProducerConnectionConfig
-
topic_routes:
Dict
[str
,str
]
- classmethod validate_topic_routes(v)
- Parameters:
v (
Dict
[str
,str
])- Return type:
Dict
[str
,str
]
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'_schema_extra': <function ConfigModel.Config._schema_extra>, 'extra': 'forbid', 'ignored_types': (<class 'cached_property.cached_property'>,), 'json_schema_extra': <function ConfigModel.Config._schema_extra>}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'connection': FieldInfo(annotation=KafkaProducerConnectionConfig, required=False, default_factory=KafkaProducerConnectionConfig), 'topic_routes': FieldInfo(annotation=Dict[str, str], required=False, default={'MetadataChangeEvent': 'MetadataChangeEvent_v4', 'MetadataChangeProposal': 'MetadataChangeProposal_v1'})}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class datahub.emitter.kafka_emitter.DatahubKafkaEmitter(config)
Bases:
Closeable
,Emitter
- Parameters:
config (
KafkaEmitterConfig
)
- emit(item, callback=None)
- Parameters:
item (
Union
[MetadataChangeEventClass
,MetadataChangeProposalClass
,MetadataChangeProposalWrapper
])callback (
Optional
[Callable
[[Exception
,str
],None
]])
- Return type:
None
- emit_mce_async(mce, callback)
- Parameters:
mce (
MetadataChangeEventClass
)callback (
Callable
[[Exception
,str
],None
])
- Return type:
None
- emit_mcp_async(mcp, callback)
- Parameters:
mcp (
Union
[MetadataChangeProposalClass
,MetadataChangeProposalWrapper
])callback (
Callable
[[Exception
,str
],None
])
- Return type:
None
- flush()
- Return type:
None
- close()
- Return type:
None
- class datahub.ingestion.graph.client.DatahubClientConfig(**data)
Bases:
ConfigModel
Configuration class for holding connectivity to datahub gms
- Parameters:
data (
Any
)server (str)
token (str | None)
timeout_sec (int | None)
retry_status_codes (List[int] | None)
retry_max_times (int | None)
extra_headers (Dict[str, str] | None)
ca_certificate_path (str | None)
client_certificate_path (str | None)
disable_ssl_verification (bool)
-
server:
str
-
token:
Optional
[str
]
-
timeout_sec:
Optional
[int
]
-
retry_status_codes:
Optional
[List
[int
]]
-
retry_max_times:
Optional
[int
]
-
extra_headers:
Optional
[Dict
[str
,str
]]
-
ca_certificate_path:
Optional
[str
]
-
client_certificate_path:
Optional
[str
]
-
disable_ssl_verification:
bool
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'_schema_extra': <function ConfigModel.Config._schema_extra>, 'extra': 'forbid', 'ignored_types': (<class 'cached_property.cached_property'>,), 'json_schema_extra': <function ConfigModel.Config._schema_extra>}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'ca_certificate_path': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'client_certificate_path': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'disable_ssl_verification': FieldInfo(annotation=bool, required=False, default=False), 'extra_headers': FieldInfo(annotation=Union[Dict[str, str], NoneType], required=False, default=None), 'retry_max_times': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'retry_status_codes': FieldInfo(annotation=Union[List[int], NoneType], required=False, default=None), 'server': FieldInfo(annotation=str, required=False, default='http://localhost:8080'), 'timeout_sec': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'token': FieldInfo(annotation=Union[str, NoneType], required=False, default=None)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- datahub.ingestion.graph.client.DataHubGraphConfig
alias of
DatahubClientConfig
- class datahub.ingestion.graph.client.RelatedEntity(urn, relationship_type, via=None)
Bases:
object
- Parameters:
urn (
str
)relationship_type (
str
)via (
Optional
[str
])
-
urn:
str
-
relationship_type:
str
-
via:
Optional
[str
] = None
- class datahub.ingestion.graph.client.DataHubGraph(config)
Bases:
DataHubRestEmitter
- Parameters:
config (
DatahubClientConfig
)
- test_connection()
- Return type:
None
- classmethod from_emitter(emitter)
- Parameters:
emitter (
DataHubRestEmitter
)- Return type:
- make_rest_sink(run_id='__datahub-graph-client')
- Parameters:
run_id (
str
)- Return type:
Iterator
[DatahubRestSink]
- emit_all(items, run_id='__datahub-graph-client')
Emit all items in the iterable using multiple threads.
- Parameters:
items (
Iterable
[Union
[MetadataChangeEventClass
,MetadataChangeProposalClass
,MetadataChangeProposalWrapper
]])run_id (
str
)
- Return type:
None
- get_aspect(entity_urn, aspect_type, version=0)
Get an aspect for an entity.
- Parameters:
entity_urn (
str
) – The urn of the entityaspect_type (
Type
[TypeVar
(Aspect
, bound=_Aspect
)]) – The type class of the aspect being requested (e.g. datahub.metadata.schema_classes.DatasetProperties)version (
int
) – The version of the aspect to retrieve. The default of 0 means latest. Versions > 0 go from oldest to newest, so 1 is the oldest.
- Return type:
Optional
[TypeVar
(Aspect
, bound=_Aspect
)]- Returns:
the Aspect as a dictionary if present, None if no aspect was found (HTTP status 404)
- Raises:
TypeError – if the aspect type is a timeseries aspect
HttpError – if the HTTP response is not a 200 or a 404
- get_aspect_v2(entity_urn, aspect_type, aspect, aspect_type_name=None, version=0)
- Parameters:
entity_urn (
str
)aspect_type (
Type
[TypeVar
(Aspect
, bound=_Aspect
)])aspect (
str
)aspect_type_name (
Optional
[str
])version (
int
)
- Return type:
Optional
[TypeVar
(Aspect
, bound=_Aspect
)]
- get_config()
- Return type:
Dict
[str
,Any
]
- get_ownership(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[OwnershipClass
]
- get_schema_metadata(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[SchemaMetadataClass
]
- get_domain_properties(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[DomainPropertiesClass
]
- get_dataset_properties(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[DatasetPropertiesClass
]
- get_tags(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[GlobalTagsClass
]
- get_glossary_terms(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[GlossaryTermsClass
]
- get_domain(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[DomainsClass
]
- get_browse_path(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
Optional
[BrowsePathsClass
]
- get_usage_aspects_from_urn(entity_urn, start_timestamp, end_timestamp)
- Parameters:
entity_urn (
str
)start_timestamp (
int
)end_timestamp (
int
)
- Return type:
Optional
[List
[DatasetUsageStatisticsClass
]]
- list_all_entity_urns(entity_type, start, count)
- Parameters:
entity_type (
str
)start (
int
)count (
int
)
- Return type:
Optional
[List
[str
]]
- get_latest_timeseries_value(entity_urn, aspect_type, filter_criteria_map)
- Parameters:
entity_urn (
str
)aspect_type (
Type
[TypeVar
(Aspect
, bound=_Aspect
)])filter_criteria_map (
Dict
[str
,str
])
- Return type:
Optional
[TypeVar
(Aspect
, bound=_Aspect
)]
- get_timeseries_values(entity_urn, aspect_type, filter, limit=10)
- Parameters:
entity_urn (
str
)aspect_type (
Type
[TypeVar
(Aspect
, bound=_Aspect
)])filter (
Dict
[str
,Any
])limit (
int
)
- Return type:
List
[TypeVar
(Aspect
, bound=_Aspect
)]
- get_entity_raw(entity_urn, aspects=None)
- Parameters:
entity_urn (
str
)aspects (
Optional
[List
[str
]])
- Return type:
Dict
- get_aspects_for_entity(entity_urn, aspects, aspect_types)
Get multiple aspects for an entity.
Deprecated in favor of get_aspect (single aspect) or get_entity_semityped (full entity without manually specifying a list of aspects).
Warning: Do not use this method to determine if an entity exists! This method will always return an entity, even if it doesn’t exist. This is an issue with how DataHub server responds to these calls, and will be fixed automatically when the server-side issue is fixed.
- Parameters:
entity_urn (
str
) – The urn of the entityaspect_type_list (List[Type[Aspect]]) – List of aspect type classes being requested (e.g. [datahub.metadata.schema_classes.DatasetProperties])
aspects_list (List[str]) – List of aspect names being requested (e.g. [schemaMetadata, datasetProperties])
entity_urn
aspects (
List
[str
])aspect_types (
List
[Type
[TypeVar
(Aspect
, bound=_Aspect
)]])
- Return type:
Dict
[str
,Optional
[TypeVar
(Aspect
, bound=_Aspect
)]]- Returns:
Optionally, a map of aspect_name to aspect_value as a dictionary if present, aspect_value will be set to None if that aspect was not found. Returns None on HTTP status 404.
- Raises:
HttpError – if the HTTP response is not a 200
- get_entity_semityped(entity_urn)
Get all non-timeseries aspects for an entity (experimental).
This method is called “semityped” because it returns aspects as a dictionary of properly typed objects. While the returned dictionary is constrained using a TypedDict, the return type is still fairly loose.
Warning: Do not use this method to determine if an entity exists! This method will always return something, even if the entity doesn’t actually exist in DataHub.
- Parameters:
entity_urn (
str
) – The urn of the entity- Return type:
- Returns:
A dictionary of aspect name to aspect value. If an aspect is not found, it will not be present in the dictionary. The entity’s key aspect will always be present.
- get_domain_urn_by_name(domain_name)
Retrieve a domain urn based on its name. Returns None if there is no match found
- Parameters:
domain_name (
str
)- Return type:
Optional
[str
]
- get_container_urns_by_filter(env=None, search_query='*')
Return container urns that match based on query
- Parameters:
env (
Optional
[str
])search_query (
str
)
- Return type:
Iterable
[str
]
- get_urns_by_filter(*, entity_types=None, platform=None, platform_instance=None, env=None, query=None, container=None, status=RemovedStatusFilter.NOT_SOFT_DELETED, batch_size=10000, extraFilters=None)
Fetch all urns that match all of the given filters.
Filters are combined conjunctively. If multiple filters are specified, the results will match all of them. Note that specifying a platform filter will automatically exclude all entity types that do not have a platform. The same goes for the env filter.
- Parameters:
entity_types (
Optional
[List
[str
]]) – List of entity types to include. If None, all entity types will be returned.platform (
Optional
[str
]) – Platform to filter on. If None, all platforms will be returned.platform_instance (
Optional
[str
]) – Platform instance to filter on. If None, all platform instances will be returned.env (
Optional
[str
]) – Environment (e.g. PROD, DEV) to filter on. If None, all environments will be returned.query (
Optional
[str
]) – Query string to filter on. If None, all entities will be returned.container (
Optional
[str
]) – A container urn that entities must be within. This works recursively, so it will include entities within sub-containers as well. If None, all entities will be returned. Note that this requires browsePathV2 aspects (added in 0.10.4+).status (
RemovedStatusFilter
) – Filter on the deletion status of the entity. The default is only return non-soft-deleted entities.extraFilters (
Optional
[List
[Dict
[str
,Any
]]]) – Additional filters to apply. If specified, the results will match all of the filters.batch_size (
int
)
- Return type:
Iterable
[str
]- Returns:
An iterable of urns that match the filters.
- get_latest_pipeline_checkpoint(pipeline_name, platform)
- Parameters:
pipeline_name (
str
)platform (
str
)
- Return type:
Optional
[Checkpoint
[GenericCheckpointState]]
- get_search_results(start=0, count=1, entity='dataset')
- Parameters:
start (
int
)count (
int
)entity (
str
)
- Return type:
Dict
- get_aspect_counts(aspect, urn_like=None)
- Parameters:
aspect (
str
)urn_like (
Optional
[str
])
- Return type:
int
- execute_graphql(query, variables=None, operation_name=None)
- Parameters:
query (
str
)variables (
Optional
[Dict
])operation_name (
Optional
[str
])
- Return type:
Dict
- class RelationshipDirection(value)
Bases:
str
,Enum
An enumeration.
- INCOMING = 'INCOMING'
- OUTGOING = 'OUTGOING'
- Parameters:
entity_urn (
str
)relationship_types (
List
[str
])direction (
RelationshipDirection
)
- Return type:
Iterable
[RelatedEntity
]
- exists(entity_urn)
- Parameters:
entity_urn (
str
)- Return type:
bool
- soft_delete_entity(urn, run_id='__datahub-graph-client', deletion_timestamp=None)
Soft-delete an entity by urn.
- Parameters:
urn (
str
) – The urn of the entity to soft-delete.run_id (
str
)deletion_timestamp (
Optional
[int
])
- Return type:
None
- hard_delete_entity(urn)
Hard delete an entity by urn.
- Parameters:
urn (
str
) – The urn of the entity to hard delete.- Return type:
Tuple
[int
,int
]- Returns:
A tuple of (rows_affected, timeseries_rows_affected).
- delete_entity(urn, hard=False)
Delete an entity by urn.
- Parameters:
urn (
str
) – The urn of the entity to delete.hard (
bool
) – Whether to hard delete the entity. If False (default), the entity will be soft deleted.
- Return type:
None
- hard_delete_timeseries_aspect(urn, aspect_name, start_time, end_time)
Hard delete timeseries aspects of an entity.
- Parameters:
urn (
str
) – The urn of the entity.aspect_name (
str
) – The name of the timeseries aspect to delete.start_time (
Optional
[datetime
]) – The start time of the timeseries data to delete. If not specified, defaults to the beginning of time.end_time (
Optional
[datetime
]) – The end time of the timeseries data to delete. If not specified, defaults to the end of time.
- Return type:
int
- Returns:
The number of timeseries rows affected.
- delete_references_to_urn(urn, dry_run=False)
Delete references to a given entity.
This is useful for cleaning up references to an entity that is about to be deleted. For example, when deleting a tag, you might use this to remove that tag from all other entities that reference it.
This does not delete the entity itself.
- Parameters:
urn (
str
) – The urn of the entity to delete references to.dry_run (
bool
) – If True, do not actually delete the references, just return the count of references and the list of related aspects.
- Return type:
Tuple
[int
,List
[Dict
]]- Returns:
A tuple of (reference_count, sample of related_aspects).
- initialize_schema_resolver_from_datahub(platform, platform_instance, env, batch_size=100)
- Parameters:
platform (
str
)platform_instance (
Optional
[str
])env (
str
)batch_size (
int
)
- Return type:
SchemaResolver
- parse_sql_lineage(sql, *, platform, platform_instance=None, env='PROD', default_db=None, default_schema=None)
- Parameters:
sql (
str
)platform (
str
)platform_instance (
Optional
[str
])env (
str
)default_db (
Optional
[str
])default_schema (
Optional
[str
])
- Return type:
SqlParsingResult
- create_tag(tag_name)
- Parameters:
tag_name (
str
)- Return type:
str
- close()
- Return type:
None
- datahub.ingestion.graph.client.get_default_graph()
- Return type:
- class datahub.ingestion.graph.client.DatahubConfig(**data)
Bases:
BaseModel
- Parameters:
data (
Any
)gms (DatahubClientConfig)
-
gms:
DatahubClientConfig
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'gms': FieldInfo(annotation=DatahubClientConfig, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- datahub.ingestion.graph.client.set_env_variables_override_config(url, token)
Should be used to override the config when using rest emitter
- Parameters:
url (
str
)token (
Optional
[str
])
- Return type:
None
- datahub.ingestion.graph.client.get_details_from_env()
- Return type:
Tuple
[Optional
[str
],Optional
[str
]]
- datahub.ingestion.graph.client.load_client_config()
- Return type:
- datahub.ingestion.graph.client.ensure_datahub_config()
- Return type:
None
- datahub.ingestion.graph.client.write_gms_config(host, token, merge_with_previous=True)
- Parameters:
host (
str
)token (
Optional
[str
])merge_with_previous (
bool
)
- Return type:
None