[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

AdkinsHan · 2024-05-17T10:12:04Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

When I used spark local mode to read the local csv file into the hive table, the data was multiplied by 3N times, but this did not happen when I used spark yarn mode. Because I used seatunnnel 1.5 before, the migration process was local, but when I tested version 2.3.5, the data was doubled.
summary :
--master local --deploy-mode client 3 times
--master yarn --deploy-mode client 3 times
--master yarn --deploy-mode cluster right
I have 2076 in my cvs file ,but select count(1) from xx then shows 3*2076

SeaTunnel Version

2.3.5

SeaTunnel Config

env {
  # seatunnel defined streaming batch duration in seconds
  execution.parallelism = 4
  job.mode = "BATCH"
  spark.executor.instances = 4
  spark.executor.cores = 4
  spark.executor.memory = "4g"
  spark.sql.catalogImplementation = "hive"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
}

source {
    LocalFile {
    schema {
            fields {
                  sku = string
                  sku_group = string
                  pb = string
                  series = string
                  pn = string
                  mater_n = string
                }
    }
      path = "/data/ghyworkbase/uploadfile/h019-ods_file_pjp_old_new_sku_yy.csv"
      file_format_type = "csv"
      skip_header_row_number=1
      result_table_name="ods_file_pjp_old_new_sku_yy_source"
    }
}

transform {
  Sql {
    source_table_name="ods_file_pjp_old_new_sku_yy_source"
    query = "select sku,sku_group,pb,series,pn,mater_n,TO_CHAR(CURRENT_DATE(),'yyyy') as dt_year from ods_file_pjp_old_new_sku_yy_source "
    result_table_name="ods_file_pjp_old_new_sku_yy"

  }
}

sink {

#   Console {
#      source_table_name = "ods_file_pjp_old_new_sku_yy"
#    }

   Hive {
     source_table_name="ods_file_pjp_old_new_sku_yy"
     table_name = "ghydata.ods_file_pjp_old_new_sku_yy"
     metastore_uri = "thrift://"
   }

}

Running Command

sh /data/seatunnel/seatunnel-2.3.4/bin/start-seatunnel-spark-3-connector-v2.sh \
  --master local \
  --deploy-mode client \
  --queue ghydl \
  --executor-instances 4 \
  --executor-cores 4 \
  --executor-memory 4g \
  --name "h019-ods_file_pjp_old_new_sku_yy" \
  --config /2.3.5/h019-ods_file_pjp_old_new_sku_yy.conf

Error Exception

nothing but data 3*

Zeta or Flink or Spark Version

No response

Java or Scala Version

/usr/local/jdk/jdk1.8.0_341

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

AdkinsHan added the bug label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

AdkinsHan commented May 17, 2024

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

Comments

AdkinsHan commented May 17, 2024

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct