Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

Open
3 tasks done
AdkinsHan opened this issue May 17, 2024 · 0 comments
Open
3 tasks done
Labels

Comments

@AdkinsHan
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When I used spark local mode to read the local csv file into the hive table, the data was multiplied by 3N times, but this did not happen when I used spark yarn mode. Because I used seatunnnel 1.5 before, the migration process was local, but when I tested version 2.3.5, the data was doubled.
summary :
--master local --deploy-mode client 3 times
--master yarn --deploy-mode client 3 times
--master yarn --deploy-mode cluster right
I have 2076 in my cvs file ,but select count(1) from xx then shows 3*2076

SeaTunnel Version

2.3.5

SeaTunnel Config

env {
  # seatunnel defined streaming batch duration in seconds
  execution.parallelism = 4
  job.mode = "BATCH"
  spark.executor.instances = 4
  spark.executor.cores = 4
  spark.executor.memory = "4g"
  spark.sql.catalogImplementation = "hive"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
}

source {
    LocalFile {
    schema {
            fields {
                  sku = string
                  sku_group = string
                  pb = string
                  series = string
                  pn = string
                  mater_n = string
                }
    }
      path = "/data/ghyworkbase/uploadfile/h019-ods_file_pjp_old_new_sku_yy.csv"
      file_format_type = "csv"
      skip_header_row_number=1
      result_table_name="ods_file_pjp_old_new_sku_yy_source"
    }
}

transform {
  Sql {
    source_table_name="ods_file_pjp_old_new_sku_yy_source"
    query = "select sku,sku_group,pb,series,pn,mater_n,TO_CHAR(CURRENT_DATE(),'yyyy') as dt_year from ods_file_pjp_old_new_sku_yy_source "
    result_table_name="ods_file_pjp_old_new_sku_yy"

  }
}

sink {

#   Console {
#      source_table_name = "ods_file_pjp_old_new_sku_yy"
#    }

   Hive {
     source_table_name="ods_file_pjp_old_new_sku_yy"
     table_name = "ghydata.ods_file_pjp_old_new_sku_yy"
     metastore_uri = "thrift://"
   }

}

Running Command

sh /data/seatunnel/seatunnel-2.3.4/bin/start-seatunnel-spark-3-connector-v2.sh \
  --master local \
  --deploy-mode client \
  --queue ghydl \
  --executor-instances 4 \
  --executor-cores 4 \
  --executor-memory 4g \
  --name "h019-ods_file_pjp_old_new_sku_yy" \
  --config /2.3.5/h019-ods_file_pjp_old_new_sku_yy.conf

Error Exception

nothing but data 3*

Zeta or Flink or Spark Version

No response

Java or Scala Version

/usr/local/jdk/jdk1.8.0_341

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@AdkinsHan AdkinsHan added the bug label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant