The fundamental reason that big data mining systems were rare and
expensive is that scaling a system to process large data sets is very difficult; as we will see,
it has traditionally been limited to the processing power that can be built into a single
computer.There are however two broad approaches to scaling a system as the size
of the data increases, generally referred to as scale-up and scale-out.
Scale-up
In most enterprises, data
processing has typically been performed on impressively large computers with
impressively larger price tags. As the size of the data grows, the approach is to
move to a bigger server or storage array. The cost of such hardware could
easily be measured in hundreds of thousands or in millions of dollars.
The advantage of
simple scale-up is that the architecture does not significantly change through
the growth. Though larger components are used, the basic relationship (for example,
database server and storage array) stays the same. For applications such as commercial
database engines, the software handles the complexities of utilizing the available
hardware, but in theory, increased scale is achieved by migrating the same software
onto larger and larger servers. Note though that the difficulty of moving
software
onto more and more processors is never trivial; in addition, there are
practical limits on just how big a single host can be, so at some point,
scale-up cannot be extended any further. The promise of a single architecture
at any scale is also unrealistic. Designing a scale-up system to handle data
sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually
apply larger versions of the same components, but the complexity of their
connectivity may vary from cheap commodity through custom hardware as the scale
increases.
Scale-out
Instead of growing a system onto larger and larger hardware, the
scale-out approach spreads the processing onto more and more machines. If the
data set doubles, simply use two servers instead of a single double-sized one.
If it doubles again, move to four hosts. The obvious benefit of this approach
is that purchase costs remain much lower than for scale-up . Server hardware
costs tend to increase sharply when one seeks to purchase larger machines , and
though a single host may cost $5,000, one with ten times the processing power may cost a hundred times as much. The downside is that we need
to develop strategies for splitting our data processing across a fleet of
servers and the tools historically used for this purpose have proven to be
complex. As a consequence, deploying a scale-out solution has required
significant engineering effort; the system developer often needs to handcraft
the mechanisms for data partitioning and reassembly, not to mention the logic
to schedule the work across the cluster and handle individual machine failures.
Note: In Traditional File System we have to move the large data
across the network so the network speed is hampered.
Whereas in Hadoop Distributed File system we don’t move the data
,we move only the task to those large data and the size of the task is lesser
than the data size. So it does not affect on network bandwidth.
No comments:
Post a Comment