java spark解决文件读取乱码问题
一、问题
环境为jdk1.8,spark3.2.1
,读取hadoop中GB18030
编码格式的文件出现乱码。
二、心酸历程
为了解决该问题,尝试过很多种方法,但都没有成功
1、textFile+Configuration方式——乱码
String filePath = "hdfs:///user/test.deflate";
//创建SparkSession和SparkContext的实例
String encoding = "GB18030";
SparkSession spark = SparkSession.builder()
.master("local[*]").appName("Spark Example")
.getOrCreate();
JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
Configuration entries = sc.hadoopConfiguration();
entries.set("textinputformat.record.delimiter", "\n");
entries.set("mapreduce.input.fileinputformat.inputdir",filePath);entries.set("mapreduce.input.fileinputformat.encoding", "GB18030");
JavaRDD<String> rdd = sc.textFile(filePath);
2、spark.read().option方式——乱码
Dataset<Row> load = spark.read().format("text").option("encoding", "GB18030").load(filePath);
load.foreach(row -> {
System.out.println(row.toString());
System.out.println(new String(row.toString().getBytes(encoding),"UTF-8"));
System.out.println(new String(row.toString().getBytes(encoding),"GBK"));
});
3、newAPIHadoopFile+Configuration——乱码
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, entries );
System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count());
longWritableTextJavaPairRDD.foreach(k->{
System.out.println(k._2);
});
4、newAPIHadoopFile+自定义类——乱码
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, GBKInputFormat.class, LongWritable.class, Text.class, entries );
System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count());
longWritableTextJavaPairRDD.foreach(k->{
System.out.println(k._2);
});
代码中GBKInputFormat.class
是TextInputFormat.class
复制将内部UTF-8
修改为GB18030
所得
5、newAPIHadoopRDD+自定义类——乱码
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, GBKInputFormat.class, LongWritable.class, Text.class);
System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count());
longWritableTextJavaPairRDD1.foreach(k->{
System.out.println(k._2());
});
3、最终解决
上述方法感觉指定的字符编码并没有生效不知道为什么,如有了解原因的还请为我解惑,谢谢
最终解决方案如下
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, new Configuration());
System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD.count());
longWritableTextJavaPairRDD.foreach(k->{
System.out.println(new String(k._2.copyBytes(), encoding));
});
JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, TextInputFormat.class, LongWritable.class, Text.class);
System.out.println("longWritableTextJavaPairRDD count ="+longWritableTextJavaPairRDD1.count());
longWritableTextJavaPairRDD1.foreach(k->{
System.out.println(new String(k._2().copyBytes(),encoding));
System.out.println(new String(k._2.copyBytes(),encoding));
});
主要是new String(k._2().copyBytes(),encoding)
得以解决