31 Matching Annotations
  1. Last 7 days
    1. Scale-out analytics frameworks such as Spark, Databricks, and Snowflake rely on clusters composed of many servers to handle memory-intensive ETL workloads, which leads to high infrastructure cost and inefficiencies from data movement and memory pressure.

      Targeting Spark/Databricks/Snowflake ETL is a strategic move beyond pure LLM inference: these are massive, established workloads with well-understood cost structures. If MX1 can consolidate multi-server ETL jobs, the ROI argument to CFOs becomes straightforward — fewer servers, same throughput, predictable savings.

  2. Oct 2025
  3. Dec 2024
    1. coalesce 会降低同一个 stage 计算的并行度,导致 cpu 利用率不高,任务执行时间变长。我们目前有一个实现是需要将最终的结果写成单个 avro 文件,前面的转换过程可能是各种各样的,我们在最后阶段加上 repartition(1).write().format('avro').mode('overwrite').save('path')。最近发现有时前面的转换过程中有排序时,使用 repartition(1) 有时写得单文件顺序不对,使用 coalesce(1) 顺序是对的,但 coalesce(1) 有性能问题。目前想到可以 collect 到 d

      https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce

  4. Dec 2022
  5. Sep 2022
  6. Jun 2022
  7. May 2022
    1. Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

      spark properties

    1. keep only what resonates in a trusted place thatyou control, and to leave the rest aside

      Though it may lead down the road to the collector's fallacy, one should note down, annotate, or highlight the things that resonate with them. Similar to Marie Kondo's concept in home organization, one ought to find ideas that "spark joy" or move them internally. These have a reasonable ability to be reused or turned into something with a bit of coaxing and work. Collect now to be able to filter later.

    Tags

    Annotators

  8. Apr 2022
    1. E-tivities generally involve the tutor providing a small piece of information, stimulus or challenge, which Salmon refers to as the 'spark'.

      Efetivamente estas e-atividades são mesmo isso, um estimulo importante neste novo ensino. Os alunos precisam de se sentir parte da "sala de aula" e de se sentirem motivados à aprendizagem.

  9. Jan 2022
  10. May 2021
  11. Apr 2021
  12. Feb 2021
  13. Jun 2020
  14. Sep 2019
  15. Mar 2019
  16. Dec 2018
  17. Nov 2018
  18. Oct 2018
  19. Jan 2018
  20. May 2017
  21. Apr 2017
  22. Apr 2014
    1. Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably: The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.) A rich set of programming primitives in connection with that model. Support also for highly-iterative processing, of the kind found in machine learning. Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets). A clever approach to fault-tolerance.

      Spark's advantages:

      • DAG processing model
      • programming primitives for DAG model
      • highly-iterative processing suited for ML
      • RDD in-memory data structures
      • clever approach to fault-tolerance