It is common scenario while reading data from kafka, that you are receiving array or array of json enclosed in a string. At first it seems intimidating to convert into actual array but this micro article will help you to solve thisproblem.

Scenario A: You are receiving data as well formed string like “[“a”,”b”,”c”]”

lets load some sample data and understand

import org.apache.spark.sql.types._
val df = sc.parallelize(Seq( ("[\"a\",\"b\",\"c\"]") )).toDF("arr_str")

The above dataframe schema depicts that the data is actually string and not array.

To transform in to array use from_json function which is already available in spark. It takes two argument column name and data type. Column name is the column which you would like to convert to dataframe and type wold be ArrayType as the we need to parse the data as array. Furthermore, the ArrayType needs one argument where in we need to specify what is the type of Array we want to convert. In this case StringType.

val df2 = df.withColumn( "arr", from_json( col("arr_str"),ArrayType(StringType) ) )

Scenario B: Receiving data in string but not well formed e.g. “[a,b,c]”

loading sample data

import org.apache.spark.sql.types._
val df = sc.parallelize(Seq( ("[a,b,c]") )).toDF("arr_str")

As the string is not well formed to directly cast it to Array[String] we will use alternate approach.

we shall replace ‘[‘ and ‘]’ character to blank and then split the data by “,”

val df2 = df.withColumn( "arr", split( regexp_replace(col("arr_str"),"[\[\]]",""), ",") )

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.