Hive文本中的特殊字符处理

hive处理存储于json字段中的html文本,碰到特殊字符导致数据与字段错位。

\n 换行符,\u000A
\r 回车符,\u000D
\t tab制表符(移至下一列) ,\u0009

使用函数regexp_replace替换特殊字符

#测试数据
etl.origin_message_sms_log.line:
{"msg_content":",xx好,\u0009风控订单扣除订单号:6625561521205614259\u0009审核结果会以短信形式告知【xx】","msg_status":{"$numberLong":"2"}}

#SQL解析
select regexp_replace(get_json_object(line,'$.msg_content'),'\n|\t|\r','') msg_content
,get_json_object(regexp_replace(get_json_object(line,'$.msg_status'),'\\\\$',''),'$.numberLong')
from etl.origin_message_sms_log
Author: Tunan
Link: http://yerias.github.io/2021/06/14/hive/26/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.