Now Reading
Ask HN: Does (or why does) anybody use MapReduce anymore?

Ask HN: Does (or why does) anybody use MapReduce anymore?

2024-01-23 22:20:36

(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

– users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, …)

– mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, …). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.

At a high level, most distributed data systems look something like MapReduce, and that’s really just fancy divide-and-conquer. It’s hard to reason about, and most data at this size is tabular, so you’re usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.

The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It’s going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)

That’s my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don’t have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it’s the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.

I don’t mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it’s not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.

I’m not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that’s what your compiler does.

But that doesn’t really mean it’s useful to think of GOTOs being the generalisation.

Most of the time, it’s just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-…

I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.

You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.

> The batch daily log processor jobs will last longer than Fortran. Longer than Cobol.

Nonsense… They’ll end at the same time. Which is approximately concurrently with the universe.

The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.

See Also

I know I’m replying to a troll comment, but:

> “I have data and I know SQL. What is it about your database that makes retrieving it better?”

Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.

That’s true. With dbt (=SQL+Jinja-Templating in an opionated framework) a large SQL codebase actually becomes maintainable. If in any way possible I’ll usually load my raw data in an OLAP table (Snowflake, BigQuery) and do all the transforms there. At least for JSON data that works really well. Combine it with dbt tests and you’re safe.

See https://www.getdbt.com/

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top