Comma inside data field is the common scenario while dealing with flat files such as CSV. Before loading, most of the project cleanse the data by removing the comma. But what if it is mandatory to retain that extra comma in data?

Simple solution : Regular Expression

Data

536381,82567,"AIRLINE LOUNGE,METAL SIGN",2,12/1/2010 9:41,2.1,15311,United Kingdom

REGEX JAVA

,(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))

RESULT

536381,82567,"AIRLINE LOUNGE,METAL SIGN",2,12/1/2010 9:41,2.1,15311,United Kingdom
[0] = 536381
[1] = 82567
[2] = AIRLINE LOUNGE,METAL SIGN
[3] = 2
[4] = 12/1/2010 9:41
[5] = 2.1
[6] = 15311
[7] = United Kingdom

JAVA MR

import java.util.regex.Pattern;

Pattern p = Pattern.compile(",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String line = value.toString();
String[] words=p.split(line);

Spark Scala

data_raw.map ( line =>  {
val eachLine = line.split(",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))")
val description = eachline(3)
})

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.