Thursday, May 28, 2020

Hive解析多重嵌套JSON数组

Hive解析多重嵌套JSON数组


最近在工作中遇到一个很难解析的JSON,他是一个嵌套的JSON数组的JSON,要使用Hive来进行解析,用Presto写了一次,逻辑就很清晰,因为Presto自带了JSON数据类型,转换数组就很方便,而Hive解析完JSON数组后是一个字符串,只能使用split方法来对string类型的数据进行切分,所以如果遇到多层嵌套的数组,要注意切分方法,不然就会乱套。

需要解析的JSON

{	"base": {		"code": "xm",		"name": "project"	},	"list": [{		"ACode": "cp1",		"AName": "Product1",		"BList": [{			"BCode": "gn1",			"BName": "Feature1"		}, {			"BCode": "gn2",			"BName": "Feature2"		}]	}, {		"ACode": "cp2",		"AName": "Product2",		"BList": [{			"BCode": "gn1",			"BName": "Feature1"		}]	}]}

解析出来的结果应该如下表所示

codenameACodeAnameBcodeBname
xmprojectcp1Product1gn1Feature1
xmprojectcp1Product1gn2Feature2
xmprojectcp2Product2gn1Feature1

解决方案

首先使用get_json_object方法,把需要解析的数组解析出来,然后使用regexp_replace}]},{替换成}]}||{,然后再使用split方法对||进行分割,分割成数组后,使用lateral view explode方法对其进行展开成多列即刻。

SELECT code , name , ai.ACode , ai.AName , bi.BCode , bi.BNameFROM (  SELECT   get_json_object(t.value, '$.base.code') AS code   , get_json_object(t.value, '$.base.name') AS name   , get_json_object(t.value, '$.list')  AS list  FROM   (    SELECT     '{"base":{"code":"xm","name":"project"},"list":[{"ACode":"cp1","AName":"Product1","BList":[{"BCode":"gn1","BName":"Feature1"},{"BCode":"gn2","BName":"Feature2"}]},{"ACode":"cp2","AName":"Product2","BList":[{"BCode":"gn1","BName":"Feature1"}]}]}' as value   )   t ) t lateral view explode(split(regexp_replace(regexp_extract(list,'^\\[(.+)\\]$',1),'\\}\\]\\}\\,\\{', '\\}\\]\\}\\|\\|\\{'),'\\|\\|')) list as a 	lateral view json_tuple(a,'ACode','AName','BList') ai as ACode , AName , BList lateral view explode(split(regexp_replace(regexp_extract(BList,'^\\[(.+)\\]$',1),'\\}\\,\\{', '\\}\\|\\|\\{'),'\\|\\|')) BList as b  lateral view json_tuple(b,'BCode','BName') bi as BCode , BName;

执行完

xm	project	cp1	Product1	gn1	Feature1xm	project	cp1	Product1	gn2	Feature2xm	project	cp2	Product2	gn1	Feature1Time taken: 0.787 seconds, Fetched: 3 row(s)

hive

总结

  1. 尝试切分为数组后,使用lateral view posexplode方案,逐层解析,但这样会导致笛卡尔。所以必须一次性全部解析好,而不是套用多个子查询逐层解析;
  2. 使用OUTER字段,能使LATERAL VIEW不忽略NULL

include OUTER in the query to get rows with NULL values

something like,

select * FROM table LATERAL VIEW OUTER explode ( split ( email ,',' ) ) email AS email_id;

No comments:

Post a Comment