+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+This pages describes how to use Kosmos Filesystem
+( KFS ) as a backing
+store with Hadoop. This page assumes that you have downloaded the
+KFS software and installed necessary binaries as outlined in the KFS
+documentation.
+
+
Steps
+
+
+
In the Hadoop conf directory edit core-site.xml,
+ add the following:
+
In the Hadoop conf directory edit core-site.xml,
+ adding the following (with appropriate values for
+ <server> and <port>):
+
+<property>
+ <name>fs.default.name</name>
+ <value>kfs://<server:port></value>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerHost</name>
+ <value><server></value>
+ <description>The location of the KFS meta server.</description>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerPort</name>
+ <value><port></value>
+ <description>The location of the meta server's port.</description>
+</property>
+
+
+
+
+
Copy KFS's kfs-0.1.jar to Hadoop's lib directory. This step
+ enables Hadoop's to load the KFS specific modules. Note
+ that, kfs-0.1.jar was built when you compiled KFS source
+ code. This jar file contains code that calls KFS's client
+ library code via JNI; the native code is in KFS's
+ libkfsClient.so library.
+
+
+
When the Hadoop map/reduce trackers start up, those
+processes (on local as well as remote nodes) will now need to load
+KFS's libkfsClient.so library. To simplify this process, it is advisable to
+store libkfsClient.so in an NFS accessible directory (similar to where
+Hadoop binaries/scripts are stored); then, modify Hadoop's
+conf/hadoop-env.sh adding the following line and providing suitable
+value for <path>:
+
+export LD_LIBRARY_PATH=<path>
+
+
+
+
Start only the map/reduce trackers
+
+ example: execute Hadoop's bin/start-mapred.sh
+Files are stored in S3 as blocks (represented by
+{@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length.
+Block metadata is stored in S3 as a small record (represented by
+{@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded
+path string as a key. Inodes record the file type (regular file or directory) and the list of blocks.
+This design makes it easy to seek to any given position in a file by reading the inode data to compute
+which block to access, then using S3's support for
+HTTP Range headers
+to start streaming from the correct position.
+Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since
+S3 does not support renames).
+
+
+For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3
+would be something like this:
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
+
+
+ Instances of this class are not threadsafe.
+
+
+
+ TODO: Update to BZip2 1.0.1
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+A mechanism for selectively retrying methods that throw exceptions under certain circumstances.
+
+
+
+This will retry any method called on unreliable four times - in this case the call()
+method - sleeping 10 seconds between
+each retry. There are a number of {@link org.apache.hadoop.io.retry.RetryPolicies retry policies}
+available, or you can implement a custom one by implementing {@link org.apache.hadoop.io.retry.RetryPolicy}.
+It is also possible to specify retry policies on a
+{@link org.apache.hadoop.io.retry.RetryProxy#create(Class, Object, Map) per-method basis}.
+
]]>
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+This package provides a mechanism for using different serialization frameworks
+in Hadoop. The property "io.serializations" defines a list of
+{@link org.apache.hadoop.io.serializer.Serialization}s that know how to create
+{@link org.apache.hadoop.io.serializer.Serializer}s and
+{@link org.apache.hadoop.io.serializer.Deserializer}s.
+
+
+
+To add a new serialization framework write an implementation of
+{@link org.apache.hadoop.io.serializer.Serialization} and add its name to the
+"io.serializations" property.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address with the ticket credentials, returning
+ the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address which is servicing the protocol protocol,
+ with the ticket credentials, returning the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name=RpcActivityForPort"
+
+ Many of the activity metrics are sampled and averaged on an interval
+ which can be specified in the metrics config file.
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+The API is abstract so that it can be implemented on top of
+a variety of metrics client libraries. The choice of
+client library is a configuration option, and different
+modules within the same application can use
+different metrics implementation libraries.
+
+Sub-packages:
+
+
org.apache.hadoop.metrics.spi
+
The abstract Server Provider Interface package. Those wishing to
+ integrate the metrics API with a particular metrics client library should
+ extend this package.
+
+
org.apache.hadoop.metrics.file
+
An implementation package which writes the metric data to
+ a file, or sends it to the standard output stream.
+
+
org.apache.hadoop.metrics.ganglia
+
An implementation package which sends metric data to
+ Ganglia.
+
+
+
Introduction to the Metrics API
+
+Here is a simple example of how to use this package to report a single
+metric value:
+
The context name will typically identify either the application, or else a
+ module within an application or library.
+
+
myRecord
+
The record name generally identifies some entity for which a set of
+ metrics are to be reported. For example, you could have a record named
+ "cacheStats" for reporting a number of statistics relating to the usage of
+ some cache in your application.
+
+
myMetric
+
This identifies a particular metric. For example, you might have metrics
+ named "cache_hits" and "cache_misses".
+
+
+
+
Tags
+
+In some cases it is useful to have multiple records with the same name. For
+example, suppose that you want to report statistics about each disk on a computer.
+In this case, the record name would be something like "diskStats", but you also
+need to identify the disk which is done by adding a tag to the record.
+The code could look something like this:
+
+
+Data is not sent immediately to the metrics system when
+MetricsRecord.update() is called. Instead it is stored in an
+internal table, and the contents of the table are sent periodically.
+This can be important for two reasons:
+
+
It means that a programmer is free to put calls to this API in an
+ inner loop, since updates can be very frequent without slowing down
+ the application significantly.
+
Some implementations can gain efficiency by combining many metrics
+ into a single UDP message.
+
+
+The API provides a timer-based callback via the
+registerUpdater() method. The benefit of this
+versus using java.util.Timer is that the callbacks will be done
+immediately before sending the data, making the data as current as possible.
+
+
Configuration
+
+It is possible to programmatically examine and modify configuration data
+before creating a context, like this:
+
+The factory attributes can be examined and modified using the following
+ContextFactorymethods:
+
+
Object getAttribute(String attributeName)
+
String[] getAttributeNames()
+
void setAttribute(String name, Object value)
+
void removeAttribute(attributeName)
+
+
+
+ContextFactory.getFactory() initializes the factory attributes by
+reading the properties file hadoop-metrics.properties if it exists
+on the class path.
+
+
+A factory attribute named:
+
+contextName.class
+
+should have as its value the fully qualified name of the class to be
+instantiated by a call of the CodeFactory method
+getContext(contextName). If this factory attribute is not
+specified, the default is to instantiate
+org.apache.hadoop.metrics.file.FileContext.
+
+
+Other factory attributes are specific to a particular implementation of this
+API and are documented elsewhere. For example, configuration attributes for
+the file and Ganglia implementations can be found in the javadoc for
+their respective packages.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.fileName
+
The path of the file to which metrics in context contextName
+ are to be appended. If this attribute is not specified, the metrics
+ are written to standard output by default.
+
+
contextName.period
+
The period in seconds on which the metric data is written to the
+ file.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Implementation of the metrics package that sends metric data to
+Ganglia.
+Programmers should not normally need to use this package directly. Instead
+they should use org.hadoop.metrics.
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.servers
+
Space and/or comma separated sequence of servers to which UDP
+ messages should be sent.
+
+
contextName.period
+
The period in seconds on which the metric data is sent to the
+ server(s).
+
+
contextName.units.recordName.metricName
+
The units for the specified metric in the specified record.
+
+
contextName.slope.recordName.metricName
+
The slope for the specified metric in the specified record.
+
+
contextName.tmax.recordName.metricName
+
The tmax for the specified metric in the specified record.
+
+
contextName.dmax.recordName.metricName
+
The dmax for the specified metric in the specified record.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+ org.apache.hadoop.metrics.file and
+org.apache.hadoop.metrics.ganglia.
+
+Plugging in an implementation involves writing a concrete subclass of
+AbstractMetricsContext. The subclass should get its
+ configuration information using the getAttribute(attributeName)
+ method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see #getOutputStream(Socket, long)
+
+ @param socket
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ socket.connect(endpoint, timeout). If
+ socket.getChannel() returns a non-null channel,
+ connect is implemented using Hadoop's selectors. This is done mainly
+ to avoid Sun's connect implementation from creating thread-local
+ selectors, since Hadoop does not have control on when these are closed
+ and could end up taking all the available file descriptors.
+
+ @see java.net.Socket#connect(java.net.SocketAddress, int)
+
+ @param socket
+ @param endpoint
+ @param timeout - timeout in milliseconds]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ node
+
+ @param node
+ a node
+ @return true if node is already in the tree; false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ scope
+ if scope starts with ~, choose one from the all nodes except for the
+ ones in scope; otherwise, choose one from scope
+ @param scope range of nodes from which a node will be choosen
+ @return the choosen node]]>
+
+
+
+
+
+
+ scope but not in excludedNodes
+ if scope starts with ~, return the number of nodes that are not
+ in scope and excludedNodes;
+ @param scope a path string that may start with ~
+ @param excludedNodes a list of nodes
+ @return number of available nodes]]>
+
+
+
+
+
+
+
+
+
+
+
+ reader
+ It linearly scans the array, if a local node is found, swap it with
+ the first element of the array.
+ If a local rack node is found, swap it with the first element following
+ the local node.
+ If neither local node or local rack node is found, put a random replica
+ location at postion 0.
+ It leaves the rest nodes untouched.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new ouput stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = getCount().
+ @param newCapacity The new capacity in bytes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Introduction
+
+ Software systems of any significant complexity require mechanisms for data
+interchange with the outside world. These interchanges typically involve the
+marshaling and unmarshaling of logical units of data to and from data streams
+(files, network connections, memory buffers etc.). Applications usually have
+some code for serializing and deserializing the data types that they manipulate
+embedded in them. The work of serialization has several features that make
+automatic code generation for it worthwhile. Given a particular output encoding
+(binary, XML, etc.), serialization of primitive types and simple compositions
+of primitives (structs, vectors etc.) is a very mechanical task. Manually
+written serialization code can be susceptible to bugs especially when records
+have a large number of fields or a record definition changes between software
+versions. Lastly, it can be very useful for applications written in different
+programming languages to be able to share and interchange data. This can be
+made a lot easier by describing the data records manipulated by these
+applications in a language agnostic manner and using the descriptions to derive
+implementations of serialization in multiple target languages.
+
+This document describes Hadoop Record I/O, a mechanism that is aimed
+at
+
+
enabling the specification of simple serializable data types (records)
+
enabling the generation of code in multiple target languages for
+marshaling and unmarshaling such types
+
providing target language specific support that will enable application
+programmers to incorporate generated code into their applications
+
+
+The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR,
+ASN.1, PADS and ICE. While these systems all include a DDL that enables
+the specification of most record types, they differ widely in what else they
+focus on. The focus in Hadoop Record I/O is on data marshaling and
+multi-lingual support. We take a translator-based approach to serialization.
+Hadoop users have to describe their data in a simple data description
+language. The Hadoop DDL translator rcc generates code that users
+can invoke in order to read/write their data from/to simple stream
+abstractions. Next we list explicitly some of the goals and non-goals of
+Hadoop Record I/O.
+
+
+
Goals
+
+
+
Support for commonly used primitive types. Hadoop should include as
+primitives commonly used builtin types from programming languages we intend to
+support.
+
+
Support for common data compositions (including recursive compositions).
+Hadoop should support widely used composite types such as structs and
+vectors.
+
+
Code generation in multiple target languages. Hadoop should be capable of
+generating serialization code in multiple target languages and should be
+easily extensible to new target languages. The initial target languages are
+C++ and Java.
+
+
Support for generated target languages. Hadooop should include support
+in the form of headers, libraries, packages for supported target languages
+that enable easy inclusion and use of generated code in applications.
+
+
Support for multiple output encodings. Candidates include
+packed binary, comma-separated text, XML etc.
+
+
Support for specifying record types in a backwards/forwards compatible
+manner. This will probably be in the form of support for optional fields in
+records. This version of the document does not include a description of the
+planned mechanism, we intend to include it in the next iteration.
+
+
+
+
Non-Goals
+
+
+
Serializing existing arbitrary C++ classes.
+
Serializing complex data structures such as trees, linked lists etc.
+
Built-in indexing schemes, compression, or check-sums.
+
Dynamic construction of objects from an XML schema.
+
+
+The remainder of this document describes the features of Hadoop record I/O
+in more detail. Section 2 describes the data types supported by the system.
+Section 3 lays out the DDL syntax with some examples of simple records.
+Section 4 describes the process of code generation with rcc. Section 5
+describes target language mappings and support for Hadoop types. We include a
+fairly complete description of C++ mappings with intent to include Java and
+others in upcoming iterations of this document. The last section talks about
+supported output encodings.
+
+
+
Data Types and Streams
+
+This section describes the primitive and composite types supported by Hadoop.
+We aim to support a set of types that can be used to simply and efficiently
+express a wide range of record types in different programming languages.
+
+
Primitive Types
+
+For the most part, the primitive types of Hadoop map directly to primitive
+types in high level programming languages. Special cases are the
+ustring (a Unicode string) and buffer types, which we believe
+find wide use and which are usually implemented in library code and not
+available as language built-ins. Hadoop also supplies these via library code
+when a target language built-in is not present and there is no widely
+adopted "standard" implementation. The complete list of primitive types is:
+
+
+
byte: An 8-bit unsigned integer.
+
boolean: A boolean value.
+
int: A 32-bit signed integer.
+
long: A 64-bit signed integer.
+
float: A single precision floating point number as described by
+ IEEE-754.
+
double: A double precision floating point number as described by
+ IEEE-754.
+
ustring: A string consisting of Unicode characters.
+
buffer: An arbitrary sequence of bytes.
+
+
+
+
Composite Types
+Hadoop supports a small set of composite types that enable the description
+of simple aggregate types and containers. A composite type is serialized
+by sequentially serializing it constituent elements. The supported
+composite types are:
+
+
+
+
record: An aggregate type like a C-struct. This is a list of
+typed fields that are together considered a single unit of data. A record
+is serialized by sequentially serializing its constituent fields. In addition
+to serialization a record has comparison operations (equality and less-than)
+implemented for it, these are defined as memberwise comparisons.
+
+
vector: A sequence of entries of the same data type, primitive
+or composite.
+
+
map: An associative container mapping instances of a key type to
+instances of a value type. The key and value types may themselves be primitive
+or composite types.
+
+
+
+
Streams
+
+Hadoop generates code for serializing and deserializing record types to
+abstract streams. For each target language Hadoop defines very simple input
+and output stream interfaces. Application writers can usually develop
+concrete implementations of these by putting a one method wrapper around
+an existing stream implementation.
+
+
+
DDL Syntax and Examples
+
+We now describe the syntax of the Hadoop data description language. This is
+followed by a few examples of DDL usage.
+
+
+
+A DDL file describes one or more record types. It begins with zero or
+more include declarations, a single mandatory module declaration
+followed by zero or more class declarations. The semantics of each of
+these declarations are described below:
+
+
+
+
include: An include declaration specifies a DDL file to be
+referenced when generating code for types in the current DDL file. Record types
+in the current compilation unit may refer to types in all included files.
+File inclusion is recursive. An include does not trigger code
+generation for the referenced file.
+
+
module: Every Hadoop DDL file must have a single module
+declaration that follows the list of includes and precedes all record
+declarations. A module declaration identifies a scope within which
+the names of all types in the current file are visible. Module names are
+mapped to C++ namespaces, Java packages etc. in generated code.
+
+
class: Records types are specified through class
+declarations. A class declaration is like a Java class declaration.
+It specifies a named record type and a list of fields that constitute records
+of the type. Usage is illustrated in the following examples.
+
+
+
+
Examples
+
+
+
A simple DDL file links.jr with just one record declaration.
+
+module links {
+ class Link {
+ ustring URL;
+ boolean isRelative;
+ ustring anchorText;
+ };
+}
+
+
+The Hadoop translator is written in Java. Invocation is done by executing a
+wrapper shell script named named rcc. It takes a list of
+record description files as a mandatory argument and an
+optional language argument (the default is Java) --language or
+-l. Thus a typical invocation would look like:
+
+$ rcc -l C++ ...
+
+
+
+
Target Language Mappings and Support
+
+For all target languages, the unit of code generation is a record type.
+For each record type, Hadoop generates code for serialization and
+deserialization, record comparison and access to record members.
+
+
C++
+
+Support for including Hadoop generated C++ code in applications comes in the
+form of a header file recordio.hh which needs to be included in source
+that uses Hadoop types and a library librecordio.a which applications need
+to be linked with. The header declares the Hadoop C++ namespace which defines
+appropriate types for the various primitives, the basic interfaces for
+records and streams and enumerates the supported serialization encodings.
+Declarations of these interfaces and a description of their semantics follow:
+
+
RecFormat: An enumeration of the serialization encodings supported
+by this implementation of Hadoop.
+
+
InStream: A simple abstraction for an input stream. This has a
+single public read method that reads n bytes from the stream into
+the buffer buf. Has the same semantics as a blocking read system
+call. Returns the number of bytes read or -1 if an error occurs.
+
+
OutStream: A simple abstraction for an output stream. This has a
+single write method that writes n bytes to the stream from the
+buffer buf. Has the same semantics as a blocking write system
+call. Returns the number of bytes written or -1 if an error occurs.
+
+
RecordReader: A RecordReader reads records one at a time from
+an underlying stream in a specified record format. The reader is instantiated
+with a stream and a serialization format. It has a read method that
+takes an instance of a record and deserializes the record from the stream.
+
+
RecordWriter: A RecordWriter writes records one at a
+time to an underlying stream in a specified record format. The writer is
+instantiated with a stream and a serialization format. It has a
+write method that takes an instance of a record and serializes the
+record to the stream.
+
+
Record: The base class for all generated record types. This has two
+public methods type and signature that return the typename and the
+type signature of the record.
+
+
+
+Two files are generated for each record file (note: not for each record). If a
+record file is named "name.jr", the generated files are
+"name.jr.cc" and "name.jr.hh" containing serialization
+implementations and record type declarations respectively.
+
+For each record in the DDL file, the generated header file will contain a
+class definition corresponding to the record type, method definitions for the
+generated type will be present in the '.cc' file. The generated class will
+inherit from the abstract class hadoop::Record. The DDL files
+module declaration determines the namespace the record belongs to.
+Each '.' delimited token in the module declaration results in the
+creation of a namespace. For instance, the declaration module docs.links
+results in the creation of a docs namespace and a nested
+docs::links namespace. In the preceding examples, the Link class
+is placed in the links namespace. The header file corresponding to
+the links.jr file will contain:
+
+
+namespace links {
+ class Link : public hadoop::Record {
+ // ....
+ };
+};
+
+
+Each field within the record will cause the generation of a private member
+declaration of the appropriate type in the class declaration, and one or more
+acccessor methods. The generated class will implement the serialize and
+deserialize methods defined in hadoop::Record+. It will also
+implement the inspection methods type and signature from
+hadoop::Record. A default constructor and virtual destructor will also
+be generated. Serialization code will read/write records into streams that
+implement the hadoop::InStream and the hadoop::OutStream interfaces.
+
+For each member of a record an accessor method is generated that returns
+either the member or a reference to the member. For members that are returned
+by value, a setter method is also generated. This is true for primitive
+data members of the types byte, int, long, boolean, float and
+double. For example, for a int field called MyField the folowing
+code is generated.
+
+
+
+For a ustring or buffer or composite field. The generated code
+only contains accessors that return a reference to the field. A const
+and a non-const accessor are generated. For example:
+
+
+
+Code generation for Java is similar to that for C++. A Java class is generated
+for each record type with private members corresponding to the fields. Getters
+and setters for fields are also generated. Some differences arise in the
+way comparison is expressed and in the mapping of modules to packages and
+classes to files. For equality testing, an equals method is generated
+for each record type. As per Java requirements a hashCode method is also
+generated. For comparison a compareTo method is generated for each
+record type. This has the semantics as defined by the Java Comparable
+interface, that is, the method returns a negative integer, zero, or a positive
+integer as the invoked object is less than, equal to, or greater than the
+comparison parameter.
+
+A .java file is generated per record type as opposed to per DDL
+file as in C++. The module declaration translates to a Java
+package declaration. The module name maps to an identical Java package
+name. In addition to this mapping, the DDL compiler creates the appropriate
+directory hierarchy for the package and places the generated .java
+files in the correct directories.
+
+
Mapping Summary
+
+
+DDL Type C++ Type Java Type
+
+boolean bool boolean
+byte int8_t byte
+int int32_t int
+long int64_t long
+float float float
+double double double
+ustring std::string java.lang.String
+buffer std::string org.apache.hadoop.record.Buffer
+class type class type class type
+vector std::vector java.util.ArrayList
+map std::map java.util.TreeMap
+
+
+
Data encodings
+
+This section describes the format of the data encodings supported by Hadoop.
+Currently, three data encodings are supported, namely binary, CSV and XML.
+
+
Binary Serialization Format
+
+The binary data encoding format is fairly dense. Serialization of composite
+types is simply defined as a concatenation of serializations of the constituent
+elements (lengths are included in vectors and maps).
+
+Composite types are serialized as follows:
+
+
class: Sequence of serialized members.
+
vector: The number of elements serialized as an int. Followed by a
+sequence of serialized elements.
+
map: The number of key value pairs serialized as an int. Followed
+by a sequence of serialized (key,value) pairs.
+
+
+Serialization of primitives is more interesting, with a zero compression
+optimization for integral types and normalization to UTF-8 for strings.
+Primitive types are serialized as follows:
+
+
+
byte: Represented by 1 byte, as is.
+
boolean: Represented by 1-byte (0 or 1)
+
int/long: Integers and longs are serialized zero compressed.
+Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a
+sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents
+the number of trailing bytes, N, as the negative number (-120-N). For example,
+the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'.
+This doesn't help much for 4-byte integers but does a reasonably good job with
+longs without bit twiddling.
+
float/double: Serialized in IEEE 754 single and double precision
+format in network byte order. This is the format used by Java.
+
ustring: Serialized as 4-byte zero compressed length followed by
+data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native
+language representation.
+
buffer: Serialized as a 4-byte zero compressed length followed by the
+raw bytes in the buffer.
+
+
+
+
CSV Serialization Format
+
+The CSV serialization format has a lot more structure than the "standard"
+Excel CSV format, but we believe the additional structure is useful because
+
+
+
it makes parsing a lot easier without detracting too much from legibility
+
the delimiters around composites make it obvious when one is reading a
+sequence of Hadoop records
+
+
+Serialization formats for the various types are detailed in the grammar that
+follows. The notable feature of the formats is the use of delimiters for
+indicating the certain field types.
+
+
+
A string field begins with a single quote (').
+
A buffer field begins with a sharp (#).
+
A class, vector or map begins with 's{', 'v{' or 'm{' respectively and
+ends with '}'.
+
+
+The CSV format can be described by the following grammar:
+
+
+
+The XML serialization format is the same used by Apache XML-RPC
+(http://ws.apache.org/xmlrpc/types.html). This is an extension of the original
+XML-RPC format and adds some additional data types. All record I/O types are
+not directly expressible in this format, and access to a DDL is required in
+order to convert these to valid types. All types primitive or composite are
+represented by <value> elements. The particular XML-RPC type is
+indicated by a nested element in the <value> element. The encoding for
+records is always UTF-8. Primitive types are serialized as follows:
+
+
+
byte: XML tag <ex:i1>. Values: 1-byte unsigned
+integers represented in US-ASCII
+
boolean: XML tag <boolean>. Values: "0" or "1"
+
int: XML tags <i4> or <int>. Values: 4-byte
+signed integers represented in US-ASCII.
+
long: XML tag <ex:i8>. Values: 8-byte signed integers
+represented in US-ASCII.
+
float: XML tag <ex:float>. Values: Single precision
+floating point numbers represented in US-ASCII.
+
double: XML tag <double>. Values: Double precision
+floating point numbers represented in US-ASCII.
+
ustring: XML tag <;string>. Values: String values
+represented as UTF-8. XML does not permit all Unicode characters in literal
+data. In particular, NULLs and control chars are not allowed. Additionally,
+XML processors are required to replace carriage returns with line feeds and to
+replace CRLF sequences with line feeds. Programming languages that we work
+with do not impose these restrictions on string types. To work around these
+restrictions, disallowed characters and CRs are percent escaped in strings.
+The '%' character is also percent escaped.
+
buffer: XML tag <string&>. Values: Arbitrary binary
+data. Represented as hexBinary, each byte is replaced by its 2-byte
+hexadecimal representation.
+
+
+Composite types are serialized as follows:
+
+
+
class: XML tag <struct>. A struct is a sequence of
+<member> elements. Each <member> element has a <name>
+element and a <value> element. The <name> is a string that must
+match /[a-zA-Z][a-zA-Z0-9_]*/. The value of the member is represented
+by a <value> element.
+
+
vector: XML tag <array<. An <array> contains a
+single <data> element. The <data> element is a sequence of
+<value> elements each of which represents an element of the vector.
+
+
map: XML tag <array>. Same as vector.
+
+
+
+For example:
+
+
+class {
+ int MY_INT; // value 5
+ vector MY_VEC; // values 0.1, -0.89, 2.45e4
+ buffer MY_BUF; // value '\00\n\tabc%'
+}
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
]]>
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Group with the given groupname.
+ @param group group name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi.
+ @param ugi user
+ @return the {@link Subject} for the user identified by ugi]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi as a comma separated string in
+ conf as a property attr
+
+ The String starts with the user name followed by the default group names,
+ and other group names.
+
+ @param conf configuration
+ @param attr property name
+ @param ugi a UnixUserGroupInformation]]>
+
+
+
+
+
+
+
+ conf
+
+ The object is expected to store with the property name attr
+ as a comma separated string that starts
+ with the user name followed by group names.
+ If the property name is not defined, return null.
+ It's assumed that there is only one UGI per user. If this user already
+ has a UGI in the ugi map, return the ugi in the map.
+ Otherwise, construct a UGI from the configuration, store it in the
+ ugi map and return it.
+
+ @param conf configuration
+ @param attr property name
+ @return a UnixUGI
+ @throws LoginException if the stored string is ill-formatted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ User with the given username.
+ @param user user name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ does not provide the stack trace for security purposes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ service as related to
+ Service Level Authorization for Hadoop.
+
+ Each service defines it's configuration key and also the necessary
+ {@link Permission} required to access the service.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash functions to consider.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+
+ this filter.
+ @param key The key to add.]]>
+
+
+
+
+
+ this filter.
+ @param key The key to test.
+ @return boolean True if the specified key belongs to this filter.
+ False otherwise.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to AND with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to OR with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to XOR with.]]>
+
+
+
+
+ this filter.
+
+ The result is assigned to this filter.]]>
+
+
+
+
+
+ this filter.
+ @param keys The list of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The collection of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The array of keys.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A filter is a data structure which aims at offering a lossy summary of a set A. The
+ key idea is to map entries of A (also called keys) into several positions
+ in a vector through the use of several hash functions.
+
+ Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).
+
+ It must be extended in order to define the real behavior.
+
+ @see Key The general behavior of a key
+ @see HashFunction A hash function]]>
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds a key with a default weight.
+ @param value The byte value of this key.]]>
+
+
+
+
+
+ Builds a key with a specified weight.
+ @param value The value of this key.
+ @param weight The weight associated to this key.]]>
+
+
+
+
+
+
+
+
+
+
+
+ this key.]]>
+
+
+
+
+ this key.]]>
+
+
+
+
+
+ this key with a specified value.
+ @param weight The increment.]]>
+
+
+
+
+ this key by one.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ length, and
+ the provided seed value
+ @param bytes input bytes
+ @param length length of the valid bytes to consider
+ @param initval seed value
+ @return hash value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The best hash table sizes are powers of 2. There is no need to do mod
+ a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.
+ For example, if you need only 10 bits, do
+ h = (h & hashmask(10));
+ In which case, the hash table should have hashsize(10) elements.
+
+
If you are hashing n strings byte[][] k, do it like this:
+ for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);
+
+
By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this
+ code any way you wish, private, educational, or commercial. It's free.
+
+
Use for hash table lookup, or anything where one collision in 2^^32 is
+ acceptable. Do NOT use for cryptographic purposes.]]>
+
+
+
+
+
+
+
+
+
+
+ lookup3.c, by Bob Jenkins, May 2006, Public Domain.
+
+ You can use this free for any purpose. It's in the public domain.
+ It has no warranty.
+
+
+ @see lookup3.c
+ @see Hash Functions (and how this
+ function compares to others such as CRC, MD?, etc
+ @see Has update on the
+ Dr. Dobbs Article]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The C version of MurmurHash 2.0 found at that site was ported
+ to Java by Andrzej Bialecki (ab at getopt org).]]>
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop-core_0.21.0.xml b/x86_64/share/hadoop/common/jdiff/hadoop-core_0.21.0.xml
new file mode 100644
index 0000000..b88dfab
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop-core_0.21.0.xml
@@ -0,0 +1,25944 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ UnsupportedOperationException
+ @param key
+ @param newKeys
+ @param customMessage]]>
+
+
+
+
+
+
+ UnsupportedOperationException
+
+ @param key Key that is to be deprecated
+ @param newKeys list of keys that take up the values of deprecated key]]>
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists. If the key is deprecated, it returns the value of
+ the first key which replaces the deprecated key and is not null
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name or its replacing property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.If the key is
+ deprecated, it returns the value of the first key which replaces
+ the deprecated key and is not null.
+
+ @param name the property name.
+ @return the value of the name property or
+ its replacing property and null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property. If
+ name is deprecated, it sets the value to the keys
+ that replace the deprecated key.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name. If the key is deprecated,
+ it returns the value of the first key which replaces the deprecated key
+ and is not null.
+ If no such property exists,
+ then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a float.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property to the given type. This
+ is equivalent to set(<name>, value.toString()).
+ @param name property name
+ @param value new value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Pattern.
+ If no such property is specified, or if the specified value is not a valid
+ Pattern, then DefaultValue is returned.
+
+ @param name property name
+ @param defaultValue default value
+ @return property value as a compiled Pattern, or defaultValue]]>
+
+
+
+
+
+
+ Pattern.
+ If the pattern is passed as null, sets the empty pattern which results in
+ further calls to getPattern(...) returning the default value.
+
+ @param name property name
+ @param pattern new value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+ name property as
+ a collection of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then empty Collection is returned.
+
+ @param name property name.
+ @return property value as a collection of Strings, or empty Collection]]>
+
+
+
+
+
+ name property as
+ an array of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then an empty array is returned.
+
+ @param name property name.
+ @return property value as an array of trimmed Strings,
+ or empty array.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of trimmed Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a List
+ of objects implementing the interface specified by xface.
+
+ An exception is thrown if any of the classes does not exist, or if it does
+ not implement the named interface.
+
+ @param name the property name.
+ @param xface the interface implemented by the classes named by
+ name.
+ @return a List of objects implementing xface.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
+ Combine {@link #OVERWRITE} with either {@link #CREATE}
+ or {@link #APPEND} does the same as only use
+ {@link #OVERWRITE}.
+ Combine {@link #CREATE} with {@link #APPEND} has the semantic:
+
Progress - to report progress on the operation - default null
+
Permission - umask is applied against permisssion: default is
+ FsPermissions:getDefault()
+
+
CreateParent - create missing parent path; default is to not
+ to create parents
+
The defaults for the following are SS defaults of the file
+ server implementing the target path. Not all parameters make sense
+ for all kinds of file system - eg. localFS ignores Blocksize,
+ replication, checksum
+
+
BufferSize - buffersize used in FSDataOutputStream
+
Blocksize - block size for file blocks
+
ReplicationFactor - replication for blocks
+
BytesPerChecksum - bytes per checksum
+
+
+
+ @return {@link FSDataOutputStream} for created file
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file f already exists
+ @throws FileNotFoundException If parent of f does not exist
+ and createParent is false
+ @throws ParentNotDirectoryException If parent of f is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is not valid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dir already
+ exists
+ @throws FileNotFoundException If parent of dir does not exist
+ and createParent is false
+ @throws ParentNotDirectoryException If parent of dir is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for dir
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path dir is not valid]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is invalid]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ If OVERWRITE option is not passed as an argument, rename fails if the dst
+ already exists.
+
+ If OVERWRITE option is passed as an argument, rename overwrites the dst if
+ it is a file or an empty directory. Rename fails if dst is a non-empty
+ directory.
+
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for details
+
+
+ @param src path to be renamed
+ @param dst new path after rename
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If dst already exists and
+ options has {@link Rename#OVERWRITE} option
+ false.
+ @throws FileNotFoundException If src does not exist
+ @throws ParentNotDirectoryException If parent of dst is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for src
+ and dst is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws HadoopIllegalArgumentException If username or
+ groupname is invalid.]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is invalid]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Given a path referring to a symlink of form:
+
+ <---X--->
+ fs://host/A/B/link
+ <-----Y----->
+
+ In this path X is the scheme and authority that identify the file system,
+ and Y is the path leading up to the final path component "link". If Y is
+ a symlink itself then let Y' be the target of Y and X' be the scheme and
+ authority of Y'. Symlink targets may:
+
+ 1. Fully qualified URIs
+
+ fs://hostX/A/B/file Resolved according to the target file system.
+
+ 2. Partially qualified URIs (eg scheme but no host)
+
+ fs:///A/B/file Resolved according to the target file sytem. Eg resolving
+ a symlink to hdfs:///A results in an exception because
+ HDFS URIs must be fully qualified, while a symlink to
+ file:///A will not since Hadoop's local file systems
+ require partially qualified URIs.
+
+ 3. Relative paths
+
+ path Resolves to [Y'][path]. Eg if Y resolves to hdfs://host/A and path
+ is "../B/file" then [Y'][path] is hdfs://host/B/file
+
+ 4. Absolute paths
+
+ path Resolves to [X'][path]. Eg if Y resolves hdfs://host/A/B and path
+ is "/file" then [X][path] is hdfs://host/file
+
+
+ @param target the target of the symbolic link
+ @param link the path to be created that points to target
+ @param createParent if true then missing parent dirs are created if
+ false then parent must exist
+
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file linkcode> already exists
+ @throws FileNotFoundException If target does not exist
+ @throws ParentNotDirectoryException If parent of link is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for
+ target or link is not supported
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+ f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ *** Path Names ***
+
+
+ The Hadoop file system supports a URI name space and URI names.
+ It offers a forest of file systems that can be referenced using fully
+ qualified URIs.
+ Two common Hadoop file systems implementations are
+
+
the local file system: file:///path
+
the hdfs file system hdfs://nnAddress:nnPort/path
+
+
+ While URI names are very flexible, it requires knowing the name or address
+ of the server. For convenience one often wants to access the default system
+ in one's environment without knowing its name/address. This has an
+ additional benefit that it allows one to change one's default fs
+ (e.g. admin moves application from cluster1 to cluster2).
+
+
+ To facilitate this, Hadoop supports a notion of a default file system.
+ The user can set his default file system, although this is
+ typically set up for you in your environment via your default config.
+ A default file system implies a default scheme and authority; slash-relative
+ names (such as /for/bar) are resolved relative to that default FS.
+ Similarly a user can also have working-directory-relative names (i.e. names
+ not starting with a slash). While the working directory is generally in the
+ same default FS, the wd can be in a different FS.
+
+ Hence Hadoop path names can be one of:
+
+
fully qualified URI: scheme://authority/path
+
slash relative names: /path relative to the default file system
+
wd-relative names: path relative to the working dir
+
+ Relative paths with scheme (scheme:foo/bar) are illegal.
+
+
+ ****The Role of the FileContext and configuration defaults****
+
+ The FileContext provides file namespace context for resolving file names;
+ it also contains the umask for permissions, In that sense it is like the
+ per-process file-related state in Unix system.
+ These two properties
+
+
default file system i.e your slash)
+
umask
+
+ in general, are obtained from the default configuration file
+ in your environment, (@see {@link Configuration}).
+
+ No other configuration parameters are obtained from the default config as
+ far as the file context layer is concerned. All file system instances
+ (i.e. deployments of file systems) have default properties; we call these
+ server side (SS) defaults. Operation like create allow one to select many
+ properties: either pass them in as explicit parameters or use
+ the SS properties.
+
+ The file system related SS defaults are
+
+
the home directory (default is "/user/userName")
+
the initial wd (only for local fs)
+
replication factor
+
block size
+
buffer size
+
bytesPerChecksum (if used).
+
+
+
+ *** Usage Model for the FileContext class ***
+
+ Example 1: use the default config read from the $HADOOP_CONFIG/core.xml.
+ Unspecified values come from core-defaults.xml in the release jar.
+
+
myFContext = FileContext.getFileContext(); // uses the default config
+ // which has your default FS
+
myFContext.create(path, ...);
+
myFContext.setWorkingDir(path)
+
myFContext.open (path, ...);
+
+ Example 2: Get a FileContext with a specific URI as the default FS
+
+
myFContext = FileContext.getFileContext(URI)
+
myFContext.create(path, ...);
+ ...
+
+ Example 3: FileContext with local file system as the default
+
+ Example 4: Use a specific config, ignoring $HADOOP_CONFIG
+ Generally you should not need use a config unless you are doing
+
+
configX = someConfigSomeOnePassedToYou.
+
myFContext = getFileContext(configX); // configX is not changed,
+ // is passed down
+
myFContext.create(path, ...);
+
...
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ path could
+ not be resolved
+ @throws IOException an I/O error occured]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f is
+ not supported
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for
+ f is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ files does not
+ exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+ Return all the files that match filePattern and are not checksum
+ files. Results are sorted by their names.
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note: character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single char that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+
+ @throws AccessControlException If access is denied
+ @throws UnsupportedFileSystemException If file system for
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dst already exists
+ @throws FileNotFoundException If src does not exist
+ @throws ParentNotDirectoryException If parent of dst is not
+ a directory
+ @throws UnsupportedFileSystemException If file system for
+ src or dst is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path dst is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.
+ This always returns a new FileSystem object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ If OVERWRITE option is not passed as an argument, rename fails
+ if the dst already exists.
+
+ If OVERWRITE option is passed as an argument, rename overwrites
+ the dst if it is a file or an empty directory. Rename fails if dst is
+ a non-empty directory.
+
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for
+ details. This default implementation is non atomic.
+
+ This method is deprecated since it is a temporary method added to
+ support the transition from FileSystem to FileContext for user
+ applications.
+
+ @param src path to be renamed
+ @param dst new path after rename
+ @throws IOException on failure]]>
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+This pages describes how to use Kosmos Filesystem
+( KFS ) as a backing
+store with Hadoop. This page assumes that you have downloaded the
+KFS software and installed necessary binaries as outlined in the KFS
+documentation.
+
+
Steps
+
+
+
In the Hadoop conf directory edit core-site.xml,
+ add the following:
+
In the Hadoop conf directory edit core-site.xml,
+ adding the following (with appropriate values for
+ <server> and <port>):
+
+<property>
+ <name>fs.default.name</name>
+ <value>kfs://<server:port></value>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerHost</name>
+ <value><server></value>
+ <description>The location of the KFS meta server.</description>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerPort</name>
+ <value><port></value>
+ <description>The location of the meta server's port.</description>
+</property>
+
+
+
+
+
Copy KFS's kfs-0.1.jar to Hadoop's lib directory. This step
+ enables Hadoop's to load the KFS specific modules. Note
+ that, kfs-0.1.jar was built when you compiled KFS source
+ code. This jar file contains code that calls KFS's client
+ library code via JNI; the native code is in KFS's
+ libkfsClient.so library.
+
+
+
When the Hadoop map/reduce trackers start up, those
+processes (on local as well as remote nodes) will now need to load
+KFS's libkfsClient.so library. To simplify this process, it is advisable to
+store libkfsClient.so in an NFS accessible directory (similar to where
+Hadoop binaries/scripts are stored); then, modify Hadoop's
+conf/hadoop-env.sh adding the following line and providing suitable
+value for <path>:
+
+export LD_LIBRARY_PATH=<path>
+
+
+
+
Start only the map/reduce trackers
+
+ example: execute Hadoop's bin/start-mapred.sh
+
+
+
+If the map/reduce job trackers start up, all file-I/O is done to KFS.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ mode is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is a tool for migrating data from an older to a newer version
+ of an S3 filesystem.
+
+
+ All files in the filesystem are migrated by re-writing the block metadata
+ - no datafiles are touched.
+
+Files are stored in S3 as blocks (represented by
+{@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length.
+Block metadata is stored in S3 as a small record (represented by
+{@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded
+path string as a key. Inodes record the file type (regular file or directory) and the list of blocks.
+This design makes it easy to seek to any given position in a file by reading the inode data to compute
+which block to access, then using S3's support for
+HTTP Range headers
+to start streaming from the correct position.
+Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since
+S3 does not support renames).
+
+
+For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3
+would be something like this:
+
+Inodes start with a leading /, while blocks are prefixed with block-.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If f is a file, this method will make a single call to S3.
+ If f is a directory, this method will make a maximum of
+ (n / 1000) + 2 calls to S3, where n is the total number of
+ files and directories contained directly in f.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link FileSystem} for reading and writing files stored on
+ Amazon S3.
+ Unlike {@link org.apache.hadoop.fs.s3.S3FileSystem} this implementation
+ stores files on S3 in their
+ native form so they can be read by other S3 tools.
+
+ A note about directories. S3 of course has no "native" support for them.
+ The idiom we choose then is: for any directory created by this class,
+ we use an empty object "#{dirpath}_$folder$" as a marker.
+ Further, to interoperate with other S3 tools, we also accept the following:
+ - an object "#{dirpath}/' denoting a directory marker
+ - if there exists any objects with the prefix "#{dirpath}/", then the
+ directory is said to exist
+ - if both a file with the name of a directory and a marker for that
+ directory exists, then the *file masks the directory*, and the directory
+ is never returned.
+
+ @see org.apache.hadoop.fs.s3.S3FileSystem]]>
+
+
+
+
+
+A distributed implementation of {@link
+org.apache.hadoop.fs.FileSystem} for reading and writing files on
+Amazon S3.
+Unlike {@link org.apache.hadoop.fs.s3.S3FileSystem}, which is block-based,
+this implementation stores
+files on S3 in their native form for interoperability with other S3 tools.
+]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nth value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nth value in the file.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ public class IntArrayWritable extends ArrayWritable {
+ public IntArrayWritable() {
+ super(IntWritable.class);
+ }
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a ByteWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ value argument is null or
+ its size is zero, the elementType argument must not be null. If
+ the argument value's size is bigger than zero, the argument
+ elementType is not be used.
+
+ @param value
+ @param elementType]]>
+
+
+
+
+ value should not be null
+ or empty.
+
+ @param value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ value and elementType. If the value argument
+ is null or its size is zero, the elementType argument must not be
+ null. If the argument value's size is bigger than zero, the
+ argument elementType is not be used.
+
+ @param value
+ @param elementType]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is an EnumSetWritable with the same value,
+ or both are null.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = 0. Otherwise,
+ the length is not available.
+ @return The opened stream.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements
+ WritableComparable<MyWritableComparable> {
+
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable other) {
+ int thisValue = this.counter;
+ int thatValue = other.counter;
+ return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @param conf the Configuration object which contains confs for creating or reinit the compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Seek by key or by file offset.
+
+ The memory footprint of a TFile includes the following:
+
+
Some constant overhead of reading or writing a compressed block.
+
+
Each compressed block requires one compression/decompression codec for
+ I/O.
+
Temporary space to buffer the key.
+
Temporary space to buffer the value (for TFile.Writer only). Values are
+ chunk encoded, so that we buffer at most one chunk of user data. By default,
+ the chunk buffer is 1MB. Reading chunked value does not require additional
+ memory.
+
+
TFile index, which is proportional to the total number of Data Blocks.
+ The total amount of memory needed to hold the index can be estimated as
+ (56+AvgKeySize)*NumBlocks.
+
MetaBlock index, which is proportional to the total number of Meta
+ Blocks.The total amount of memory needed to hold the index for Meta Blocks
+ can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.
+
+
+ The behavior of TFile can be customized by the following variables through
+ Configuration:
+
+
tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default
+ to 1MB. Values of the length less than the chunk size is guaranteed to have
+ known value length in read time (See
+ {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
+
tfile.fs.output.buffer.size: Buffer size used for
+ FSDataOutputStream. Integer (in bytes). Default to 256KB.
+
tfile.fs.input.buffer.size: Buffer size used for
+ FSDataInputStream. Integer (in bytes). Default to 256KB.
+
+
+ Suggestions on performance optimization.
+
+
Minimum block size. We recommend a setting of minimum block size between
+ 256KB to 1MB for general usage. Larger block size is preferred if files are
+ primarily for sequential access. However, it would lead to inefficient random
+ access (because there are more data to decompress). Smaller blocks are good
+ for random access, but require more memory to hold the block index, and may
+ be slower to create (because we must flush the compressor stream at the
+ conclusion of each data block, which leads to an FS I/O flush). Further, due
+ to the internal caching in Compression codec, the smallest possible block
+ size would be around 20KB-30KB.
+
The current implementation does not offer true multi-threading for
+ reading. The implementation uses FSDataInputStream seek()+read(), which is
+ shown to be much faster than positioned-read call in single thread mode.
+ However, it also means that if multiple threads attempt to access the same
+ TFile (using multiple scanners) simultaneously, the actual I/O is carried out
+ sequentially even if they access different DFS blocks.
+
Compression codec. Use "none" if the data is not very compressable (by
+ compressable, I mean a compression ratio at least 2:1). Generally, use "lzo"
+ as the starting point for experimenting. "gz" overs slightly better
+ compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to
+ decompress, comparing to "lzo".
+
File system buffering, if the underlying FSDataInputStream and
+ FSDataOutputStream is already adequately buffered; or if applications
+ reads/writes keys and values in large buffers, we can reduce the sizes of
+ input/output buffering in TFile layer by setting the configuration parameters
+ "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
+
+
+ Some design rationale behind TFile can be found at Hadoop-3315.]]>
+
+ Use {@link Scanner#advance()} to move the cursor to the next key-value
+ pair (or end if none exists). Use seekTo methods (
+ {@link Scanner#seekTo(byte[])} or
+ {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary
+ location in the covered range (including backward seeking). Use
+ {@link Scanner#rewind()} to seek back to the beginning of the scanner.
+ Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.
+
+ Actual keys and values may be obtained through {@link Scanner.Entry}
+ object, which is obtained through {@link Scanner#entry()}.]]>
+
Algorithmic comparator: binary comparators that is language
+ independent. Currently, only "memcmp" is supported.
+
Language-specific comparator: binary comparators that can
+ only be constructed in specific language. For Java, the syntax
+ is "jclass:", followed by the class name of the RawComparator.
+ Currently, we only support RawComparators that can be
+ constructed through the default constructor (with no
+ parameters). Parameterized RawComparators such as
+ {@link WritableComparator} or
+ {@link JavaSerializationComparator} may not be directly used.
+ One should write a wrapper class that inherits from such classes
+ and use its default constructor to perform proper
+ initialization.
+
+ @param conf
+ The configuration object.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If an exception is thrown, the TFile will be in an inconsistent
+ state. The only legitimate call after that would be close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Utils#writeVLong(out, n).
+
+ @param out
+ output stream
+ @param n
+ The integer to be encoded
+ @throws IOException
+ @see Utils#writeVLong(DataOutput, long)]]>
+
+
+
+
+
+
+
+
+
if n in [-32, 127): encode in one byte with the actual value.
+ Otherwise,
+
if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52;
+ byte[1]=n&0xff. Otherwise,
+
if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 -
+ 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
+
if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112;
+ byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
+
if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] =
+ (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
+
if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] =
+ (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff;
+ byte[4]=(n>>8)&0xff; byte[5]=n&0xff
+
if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] =
+ (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff;
+ byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
+
if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] =
+ (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff;
+ byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff;
+ byte[7]=n&0xff;
+
if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] =
+ (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff;
+ byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff;
+ byte[7]=(n>>8)&0xff; byte[8]=n&0xff;
+
+
+ @param out
+ output stream
+ @param n
+ the integer number
+ @throws IOException]]>
+
if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
+
if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 +
+ NB[1]&0xff;
+
if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 +
+ (NB[1]&0xff)<<8 + NB[2]&0xff;
+
if (FB in [-128, -121]), return interpret NB[FB+129] as a signed
+ big-endian integer.
+
+ @param in
+ input stream
+ @return the decoded long integer.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+This package provides a mechanism for using different serialization frameworks
+in Hadoop. The property "io.serializations" defines a list of
+{@link org.apache.hadoop.io.serializer.Serialization}s that know how to create
+{@link org.apache.hadoop.io.serializer.Serializer}s and
+{@link org.apache.hadoop.io.serializer.Deserializer}s.
+
+
+
+To add a new serialization framework write an implementation of
+{@link org.apache.hadoop.io.serializer.Serialization} and add its name to the
+"io.serializations" property.
+
+Use {@link org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization} for
+serialization of classes generated by Avro's 'specific' compiler.
+
+
+
+Use {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} for
+other classes.
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} work for
+any class which is either in the package list configured via
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization#AVRO_REFLECT_PACKAGES}
+or implement {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerializable}
+interface.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+The API is abstract so that it can be implemented on top of
+a variety of metrics client libraries. The choice of
+client library is a configuration option, and different
+modules within the same application can use
+different metrics implementation libraries.
+
+Sub-packages:
+
+
org.apache.hadoop.metrics.spi
+
The abstract Server Provider Interface package. Those wishing to
+ integrate the metrics API with a particular metrics client library should
+ extend this package.
+
+
org.apache.hadoop.metrics.file
+
An implementation package which writes the metric data to
+ a file, or sends it to the standard output stream.
+
+
org.apache.hadoop.metrics.ganglia
+
An implementation package which sends metric data to
+ Ganglia.
+
+
+
Introduction to the Metrics API
+
+Here is a simple example of how to use this package to report a single
+metric value:
+
The context name will typically identify either the application, or else a
+ module within an application or library.
+
+
myRecord
+
The record name generally identifies some entity for which a set of
+ metrics are to be reported. For example, you could have a record named
+ "cacheStats" for reporting a number of statistics relating to the usage of
+ some cache in your application.
+
+
myMetric
+
This identifies a particular metric. For example, you might have metrics
+ named "cache_hits" and "cache_misses".
+
+
+
+
Tags
+
+In some cases it is useful to have multiple records with the same name. For
+example, suppose that you want to report statistics about each disk on a computer.
+In this case, the record name would be something like "diskStats", but you also
+need to identify the disk which is done by adding a tag to the record.
+The code could look something like this:
+
+
+Data is not sent immediately to the metrics system when
+MetricsRecord.update() is called. Instead it is stored in an
+internal table, and the contents of the table are sent periodically.
+This can be important for two reasons:
+
+
It means that a programmer is free to put calls to this API in an
+ inner loop, since updates can be very frequent without slowing down
+ the application significantly.
+
Some implementations can gain efficiency by combining many metrics
+ into a single UDP message.
+
+
+The API provides a timer-based callback via the
+registerUpdater() method. The benefit of this
+versus using java.util.Timer is that the callbacks will be done
+immediately before sending the data, making the data as current as possible.
+
+
Configuration
+
+It is possible to programmatically examine and modify configuration data
+before creating a context, like this:
+
+The factory attributes can be examined and modified using the following
+ContextFactorymethods:
+
+
Object getAttribute(String attributeName)
+
String[] getAttributeNames()
+
void setAttribute(String name, Object value)
+
void removeAttribute(attributeName)
+
+
+
+ContextFactory.getFactory() initializes the factory attributes by
+reading the properties file hadoop-metrics.properties if it exists
+on the class path.
+
+
+A factory attribute named:
+
+contextName.class
+
+should have as its value the fully qualified name of the class to be
+instantiated by a call of the CodeFactory method
+getContext(contextName). If this factory attribute is not
+specified, the default is to instantiate
+org.apache.hadoop.metrics.file.FileContext.
+
+
+Other factory attributes are specific to a particular implementation of this
+API and are documented elsewhere. For example, configuration attributes for
+the file and Ganglia implementations can be found in the javadoc for
+their respective packages.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.fileName
+
The path of the file to which metrics in context contextName
+ are to be appended. If this attribute is not specified, the metrics
+ are written to standard output by default.
+
+
contextName.period
+
The period in seconds on which the metric data is written to the
+ file.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Implementation of the metrics package that sends metric data to
+Ganglia.
+Programmers should not normally need to use this package directly. Instead
+they should use org.hadoop.metrics.
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.servers
+
Space and/or comma separated sequence of servers to which UDP
+ messages should be sent.
+
+
contextName.period
+
The period in seconds on which the metric data is sent to the
+ server(s).
+
+
contextName.units.recordName.metricName
+
The units for the specified metric in the specified record.
+
+
contextName.slope.recordName.metricName
+
The slope for the specified metric in the specified record.
+
+
contextName.tmax.recordName.metricName
+
The tmax for the specified metric in the specified record.
+
+
contextName.dmax.recordName.metricName
+
The dmax for the specified metric in the specified record.
+
+ Software systems of any significant complexity require mechanisms for data
+interchange with the outside world. These interchanges typically involve the
+marshaling and unmarshaling of logical units of data to and from data streams
+(files, network connections, memory buffers etc.). Applications usually have
+some code for serializing and deserializing the data types that they manipulate
+embedded in them. The work of serialization has several features that make
+automatic code generation for it worthwhile. Given a particular output encoding
+(binary, XML, etc.), serialization of primitive types and simple compositions
+of primitives (structs, vectors etc.) is a very mechanical task. Manually
+written serialization code can be susceptible to bugs especially when records
+have a large number of fields or a record definition changes between software
+versions. Lastly, it can be very useful for applications written in different
+programming languages to be able to share and interchange data. This can be
+made a lot easier by describing the data records manipulated by these
+applications in a language agnostic manner and using the descriptions to derive
+implementations of serialization in multiple target languages.
+
+This document describes Hadoop Record I/O, a mechanism that is aimed
+at
+
+
enabling the specification of simple serializable data types (records)
+
enabling the generation of code in multiple target languages for
+marshaling and unmarshaling such types
+
providing target language specific support that will enable application
+programmers to incorporate generated code into their applications
+
+
+The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR,
+ASN.1, PADS and ICE. While these systems all include a DDL that enables
+the specification of most record types, they differ widely in what else they
+focus on. The focus in Hadoop Record I/O is on data marshaling and
+multi-lingual support. We take a translator-based approach to serialization.
+Hadoop users have to describe their data in a simple data description
+language. The Hadoop DDL translator rcc generates code that users
+can invoke in order to read/write their data from/to simple stream
+abstractions. Next we list explicitly some of the goals and non-goals of
+Hadoop Record I/O.
+
+
+
Goals
+
+
+
Support for commonly used primitive types. Hadoop should include as
+primitives commonly used builtin types from programming languages we intend to
+support.
+
+
Support for common data compositions (including recursive compositions).
+Hadoop should support widely used composite types such as structs and
+vectors.
+
+
Code generation in multiple target languages. Hadoop should be capable of
+generating serialization code in multiple target languages and should be
+easily extensible to new target languages. The initial target languages are
+C++ and Java.
+
+
Support for generated target languages. Hadooop should include support
+in the form of headers, libraries, packages for supported target languages
+that enable easy inclusion and use of generated code in applications.
+
+
Support for multiple output encodings. Candidates include
+packed binary, comma-separated text, XML etc.
+
+
Support for specifying record types in a backwards/forwards compatible
+manner. This will probably be in the form of support for optional fields in
+records. This version of the document does not include a description of the
+planned mechanism, we intend to include it in the next iteration.
+
+
+
+
Non-Goals
+
+
+
Serializing existing arbitrary C++ classes.
+
Serializing complex data structures such as trees, linked lists etc.
+
Built-in indexing schemes, compression, or check-sums.
+
Dynamic construction of objects from an XML schema.
+
+
+The remainder of this document describes the features of Hadoop record I/O
+in more detail. Section 2 describes the data types supported by the system.
+Section 3 lays out the DDL syntax with some examples of simple records.
+Section 4 describes the process of code generation with rcc. Section 5
+describes target language mappings and support for Hadoop types. We include a
+fairly complete description of C++ mappings with intent to include Java and
+others in upcoming iterations of this document. The last section talks about
+supported output encodings.
+
+
+
Data Types and Streams
+
+This section describes the primitive and composite types supported by Hadoop.
+We aim to support a set of types that can be used to simply and efficiently
+express a wide range of record types in different programming languages.
+
+
Primitive Types
+
+For the most part, the primitive types of Hadoop map directly to primitive
+types in high level programming languages. Special cases are the
+ustring (a Unicode string) and buffer types, which we believe
+find wide use and which are usually implemented in library code and not
+available as language built-ins. Hadoop also supplies these via library code
+when a target language built-in is not present and there is no widely
+adopted "standard" implementation. The complete list of primitive types is:
+
+
+
byte: An 8-bit unsigned integer.
+
boolean: A boolean value.
+
int: A 32-bit signed integer.
+
long: A 64-bit signed integer.
+
float: A single precision floating point number as described by
+ IEEE-754.
+
double: A double precision floating point number as described by
+ IEEE-754.
+
ustring: A string consisting of Unicode characters.
+
buffer: An arbitrary sequence of bytes.
+
+
+
+
Composite Types
+Hadoop supports a small set of composite types that enable the description
+of simple aggregate types and containers. A composite type is serialized
+by sequentially serializing it constituent elements. The supported
+composite types are:
+
+
+
+
record: An aggregate type like a C-struct. This is a list of
+typed fields that are together considered a single unit of data. A record
+is serialized by sequentially serializing its constituent fields. In addition
+to serialization a record has comparison operations (equality and less-than)
+implemented for it, these are defined as memberwise comparisons.
+
+
vector: A sequence of entries of the same data type, primitive
+or composite.
+
+
map: An associative container mapping instances of a key type to
+instances of a value type. The key and value types may themselves be primitive
+or composite types.
+
+
+
+
Streams
+
+Hadoop generates code for serializing and deserializing record types to
+abstract streams. For each target language Hadoop defines very simple input
+and output stream interfaces. Application writers can usually develop
+concrete implementations of these by putting a one method wrapper around
+an existing stream implementation.
+
+
+
DDL Syntax and Examples
+
+We now describe the syntax of the Hadoop data description language. This is
+followed by a few examples of DDL usage.
+
+
+
+A DDL file describes one or more record types. It begins with zero or
+more include declarations, a single mandatory module declaration
+followed by zero or more class declarations. The semantics of each of
+these declarations are described below:
+
+
+
+
include: An include declaration specifies a DDL file to be
+referenced when generating code for types in the current DDL file. Record types
+in the current compilation unit may refer to types in all included files.
+File inclusion is recursive. An include does not trigger code
+generation for the referenced file.
+
+
module: Every Hadoop DDL file must have a single module
+declaration that follows the list of includes and precedes all record
+declarations. A module declaration identifies a scope within which
+the names of all types in the current file are visible. Module names are
+mapped to C++ namespaces, Java packages etc. in generated code.
+
+
class: Records types are specified through class
+declarations. A class declaration is like a Java class declaration.
+It specifies a named record type and a list of fields that constitute records
+of the type. Usage is illustrated in the following examples.
+
+
+
+
Examples
+
+
+
A simple DDL file links.jr with just one record declaration.
+
+module links {
+ class Link {
+ ustring URL;
+ boolean isRelative;
+ ustring anchorText;
+ };
+}
+
+
+The Hadoop translator is written in Java. Invocation is done by executing a
+wrapper shell script named named rcc. It takes a list of
+record description files as a mandatory argument and an
+optional language argument (the default is Java) --language or
+-l. Thus a typical invocation would look like:
+
+$ rcc -l C++ ...
+
+
+
+
Target Language Mappings and Support
+
+For all target languages, the unit of code generation is a record type.
+For each record type, Hadoop generates code for serialization and
+deserialization, record comparison and access to record members.
+
+
C++
+
+Support for including Hadoop generated C++ code in applications comes in the
+form of a header file recordio.hh which needs to be included in source
+that uses Hadoop types and a library librecordio.a which applications need
+to be linked with. The header declares the Hadoop C++ namespace which defines
+appropriate types for the various primitives, the basic interfaces for
+records and streams and enumerates the supported serialization encodings.
+Declarations of these interfaces and a description of their semantics follow:
+
+
RecFormat: An enumeration of the serialization encodings supported
+by this implementation of Hadoop.
+
+
InStream: A simple abstraction for an input stream. This has a
+single public read method that reads n bytes from the stream into
+the buffer buf. Has the same semantics as a blocking read system
+call. Returns the number of bytes read or -1 if an error occurs.
+
+
OutStream: A simple abstraction for an output stream. This has a
+single write method that writes n bytes to the stream from the
+buffer buf. Has the same semantics as a blocking write system
+call. Returns the number of bytes written or -1 if an error occurs.
+
+
RecordReader: A RecordReader reads records one at a time from
+an underlying stream in a specified record format. The reader is instantiated
+with a stream and a serialization format. It has a read method that
+takes an instance of a record and deserializes the record from the stream.
+
+
RecordWriter: A RecordWriter writes records one at a
+time to an underlying stream in a specified record format. The writer is
+instantiated with a stream and a serialization format. It has a
+write method that takes an instance of a record and serializes the
+record to the stream.
+
+
Record: The base class for all generated record types. This has two
+public methods type and signature that return the typename and the
+type signature of the record.
+
+
+
+Two files are generated for each record file (note: not for each record). If a
+record file is named "name.jr", the generated files are
+"name.jr.cc" and "name.jr.hh" containing serialization
+implementations and record type declarations respectively.
+
+For each record in the DDL file, the generated header file will contain a
+class definition corresponding to the record type, method definitions for the
+generated type will be present in the '.cc' file. The generated class will
+inherit from the abstract class hadoop::Record. The DDL files
+module declaration determines the namespace the record belongs to.
+Each '.' delimited token in the module declaration results in the
+creation of a namespace. For instance, the declaration module docs.links
+results in the creation of a docs namespace and a nested
+docs::links namespace. In the preceding examples, the Link class
+is placed in the links namespace. The header file corresponding to
+the links.jr file will contain:
+
+
+namespace links {
+ class Link : public hadoop::Record {
+ // ....
+ };
+};
+
+
+Each field within the record will cause the generation of a private member
+declaration of the appropriate type in the class declaration, and one or more
+acccessor methods. The generated class will implement the serialize and
+deserialize methods defined in hadoop::Record+. It will also
+implement the inspection methods type and signature from
+hadoop::Record. A default constructor and virtual destructor will also
+be generated. Serialization code will read/write records into streams that
+implement the hadoop::InStream and the hadoop::OutStream interfaces.
+
+For each member of a record an accessor method is generated that returns
+either the member or a reference to the member. For members that are returned
+by value, a setter method is also generated. This is true for primitive
+data members of the types byte, int, long, boolean, float and
+double. For example, for a int field called MyField the folowing
+code is generated.
+
+
+
+For a ustring or buffer or composite field. The generated code
+only contains accessors that return a reference to the field. A const
+and a non-const accessor are generated. For example:
+
+
+
+Code generation for Java is similar to that for C++. A Java class is generated
+for each record type with private members corresponding to the fields. Getters
+and setters for fields are also generated. Some differences arise in the
+way comparison is expressed and in the mapping of modules to packages and
+classes to files. For equality testing, an equals method is generated
+for each record type. As per Java requirements a hashCode method is also
+generated. For comparison a compareTo method is generated for each
+record type. This has the semantics as defined by the Java Comparable
+interface, that is, the method returns a negative integer, zero, or a positive
+integer as the invoked object is less than, equal to, or greater than the
+comparison parameter.
+
+A .java file is generated per record type as opposed to per DDL
+file as in C++. The module declaration translates to a Java
+package declaration. The module name maps to an identical Java package
+name. In addition to this mapping, the DDL compiler creates the appropriate
+directory hierarchy for the package and places the generated .java
+files in the correct directories.
+
+
Mapping Summary
+
+
+DDL Type C++ Type Java Type
+
+boolean bool boolean
+byte int8_t byte
+int int32_t int
+long int64_t long
+float float float
+double double double
+ustring std::string java.lang.String
+buffer std::string org.apache.hadoop.record.Buffer
+class type class type class type
+vector std::vector java.util.ArrayList
+map std::map java.util.TreeMap
+
+
+
Data encodings
+
+This section describes the format of the data encodings supported by Hadoop.
+Currently, three data encodings are supported, namely binary, CSV and XML.
+
+
Binary Serialization Format
+
+The binary data encoding format is fairly dense. Serialization of composite
+types is simply defined as a concatenation of serializations of the constituent
+elements (lengths are included in vectors and maps).
+
+Composite types are serialized as follows:
+
+
class: Sequence of serialized members.
+
vector: The number of elements serialized as an int. Followed by a
+sequence of serialized elements.
+
map: The number of key value pairs serialized as an int. Followed
+by a sequence of serialized (key,value) pairs.
+
+
+Serialization of primitives is more interesting, with a zero compression
+optimization for integral types and normalization to UTF-8 for strings.
+Primitive types are serialized as follows:
+
+
+
byte: Represented by 1 byte, as is.
+
boolean: Represented by 1-byte (0 or 1)
+
int/long: Integers and longs are serialized zero compressed.
+Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a
+sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents
+the number of trailing bytes, N, as the negative number (-120-N). For example,
+the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'.
+This doesn't help much for 4-byte integers but does a reasonably good job with
+longs without bit twiddling.
+
float/double: Serialized in IEEE 754 single and double precision
+format in network byte order. This is the format used by Java.
+
ustring: Serialized as 4-byte zero compressed length followed by
+data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native
+language representation.
+
buffer: Serialized as a 4-byte zero compressed length followed by the
+raw bytes in the buffer.
+
+
+
+
CSV Serialization Format
+
+The CSV serialization format has a lot more structure than the "standard"
+Excel CSV format, but we believe the additional structure is useful because
+
+
+
it makes parsing a lot easier without detracting too much from legibility
+
the delimiters around composites make it obvious when one is reading a
+sequence of Hadoop records
+
+
+Serialization formats for the various types are detailed in the grammar that
+follows. The notable feature of the formats is the use of delimiters for
+indicating the certain field types.
+
+
+
A string field begins with a single quote (').
+
A buffer field begins with a sharp (#).
+
A class, vector or map begins with 's{', 'v{' or 'm{' respectively and
+ends with '}'.
+
+
+The CSV format can be described by the following grammar:
+
+
+
+The XML serialization format is the same used by Apache XML-RPC
+(http://ws.apache.org/xmlrpc/types.html). This is an extension of the original
+XML-RPC format and adds some additional data types. All record I/O types are
+not directly expressible in this format, and access to a DDL is required in
+order to convert these to valid types. All types primitive or composite are
+represented by <value> elements. The particular XML-RPC type is
+indicated by a nested element in the <value> element. The encoding for
+records is always UTF-8. Primitive types are serialized as follows:
+
+
+
byte: XML tag <ex:i1>. Values: 1-byte unsigned
+integers represented in US-ASCII
+
boolean: XML tag <boolean>. Values: "0" or "1"
+
int: XML tags <i4> or <int>. Values: 4-byte
+signed integers represented in US-ASCII.
+
long: XML tag <ex:i8>. Values: 8-byte signed integers
+represented in US-ASCII.
+
float: XML tag <ex:float>. Values: Single precision
+floating point numbers represented in US-ASCII.
+
double: XML tag <double>. Values: Double precision
+floating point numbers represented in US-ASCII.
+
ustring: XML tag <;string>. Values: String values
+represented as UTF-8. XML does not permit all Unicode characters in literal
+data. In particular, NULLs and control chars are not allowed. Additionally,
+XML processors are required to replace carriage returns with line feeds and to
+replace CRLF sequences with line feeds. Programming languages that we work
+with do not impose these restrictions on string types. To work around these
+restrictions, disallowed characters and CRs are percent escaped in strings.
+The '%' character is also percent escaped.
+
buffer: XML tag <string&>. Values: Arbitrary binary
+data. Represented as hexBinary, each byte is replaced by its 2-byte
+hexadecimal representation.
+
+
+Composite types are serialized as follows:
+
+
+
class: XML tag <struct>. A struct is a sequence of
+<member> elements. Each <member> element has a <name>
+element and a <value> element. The <name> is a string that must
+match /[a-zA-Z][a-zA-Z0-9_]*/. The value of the member is represented
+by a <value> element.
+
+
vector: XML tag <array<. An <array> contains a
+single <data> element. The <data> element is a sequence of
+<value> elements each of which represents an element of the vector.
+
+
map: XML tag <array>. Same as vector.
+
+
+
+For example:
+
+
+class {
+ int MY_INT; // value 5
+ vector MY_VEC; // values 0.1, -0.89, 2.45e4
+ buffer MY_BUF; // value '\00\n\tabc%'
+}
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyMapper.class);
+ job.setReducerClass(MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ return 0;
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new MyApp(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop-core_0.22.0.xml b/x86_64/share/hadoop/common/jdiff/hadoop-core_0.22.0.xml
new file mode 100644
index 0000000..b3130dc
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop-core_0.22.0.xml
@@ -0,0 +1,28377 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ UnsupportedOperationException
+ @param key
+ @param newKeys
+ @param customMessage]]>
+
+
+
+
+
+
+ UnsupportedOperationException
+
+ @param key Key that is to be deprecated
+ @param newKeys list of keys that take up the values of deprecated key]]>
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists. If the key is deprecated, it returns the value of
+ the first key which replaces the deprecated key and is not null
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name or its replacing property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property as a trimmed String,
+ null if no such property exists.
+ If the key is deprecated, it returns the value of
+ the first key which replaces the deprecated key and is not null
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name or its replacing property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.If the key is
+ deprecated, it returns the value of the first key which replaces
+ the deprecated key and is not null.
+
+ @param name the property name.
+ @return the value of the name property or
+ its replacing property and null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property. If
+ name is deprecated, it sets the value to the keys
+ that replace the deprecated key.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name. If the key is deprecated,
+ it returns the value of the first key which replaces the deprecated key
+ and is not null.
+ If no such property exists,
+ then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a float.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property to the given type. This
+ is equivalent to set(<name>, value.toString()).
+ @param name property name
+ @param value new value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Pattern.
+ If no such property is specified, or if the specified value is not a valid
+ Pattern, then DefaultValue is returned.
+
+ @param name property name
+ @param defaultValue default value
+ @return property value as a compiled Pattern, or defaultValue]]>
+
+
+
+
+
+
+ Pattern.
+ If the pattern is passed as null, sets the empty pattern which results in
+ further calls to getPattern(...) returning the default value.
+
+ @param name property name
+ @param pattern new value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+ name property as
+ a collection of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then empty Collection is returned.
+
+ @param name property name.
+ @return property value as a collection of Strings, or empty Collection]]>
+
+
+
+
+
+ name property as
+ an array of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then an empty array is returned.
+
+ @param name property name.
+ @return property value as an array of trimmed Strings,
+ or empty array.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings, trimmed of the leading and trailing whitespace.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of trimmed Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a List
+ of objects implementing the interface specified by xface.
+
+ An exception is thrown if any of the classes does not exist, or if it does
+ not implement the named interface.
+
+ @param name the property name.
+ @param xface the interface implemented by the classes named by
+ name.
+ @return a List of objects implementing xface.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ with matching keys]]>
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
+ Combine {@link #OVERWRITE} with either {@link #CREATE}
+ or {@link #APPEND} does the same as only use
+ {@link #OVERWRITE}.
+ Combine {@link #CREATE} with {@link #APPEND} has the semantic:
+
Progress - to report progress on the operation - default null
+
Permission - umask is applied against permisssion: default is
+ FsPermissions:getDefault()
+
+
CreateParent - create missing parent path; default is to not
+ to create parents
+
The defaults for the following are SS defaults of the file
+ server implementing the target path. Not all parameters make sense
+ for all kinds of file system - eg. localFS ignores Blocksize,
+ replication, checksum
+
+
BufferSize - buffersize used in FSDataOutputStream
+
Blocksize - block size for file blocks
+
ReplicationFactor - replication for blocks
+
BytesPerChecksum - bytes per checksum
+
+
+
+ @return {@link FSDataOutputStream} for created file
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file f already exists
+ @throws FileNotFoundException If parent of f does not exist
+ and createParent is false
+ @throws ParentNotDirectoryException If parent of f is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is not valid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dir already
+ exists
+ @throws FileNotFoundException If parent of dir does not exist
+ and createParent is false
+ @throws ParentNotDirectoryException If parent of dir is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for dir
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path dir is not valid]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is invalid]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ If OVERWRITE option is not passed as an argument, rename fails if the dst
+ already exists.
+
+ If OVERWRITE option is passed as an argument, rename overwrites the dst if
+ it is a file or an empty directory. Rename fails if dst is a non-empty
+ directory.
+
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for details
+
+
+ @param src path to be renamed
+ @param dst new path after rename
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If dst already exists and
+ options has {@link Rename#OVERWRITE} option
+ false.
+ @throws FileNotFoundException If src does not exist
+ @throws ParentNotDirectoryException If parent of dst is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for src
+ and dst is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws HadoopIllegalArgumentException If username or
+ groupname is invalid.]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If the given path does not refer to a symlink
+ or an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path f is invalid]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Given a path referring to a symlink of form:
+
+ <---X--->
+ fs://host/A/B/link
+ <-----Y----->
+
+ In this path X is the scheme and authority that identify the file system,
+ and Y is the path leading up to the final path component "link". If Y is
+ a symlink itself then let Y' be the target of Y and X' be the scheme and
+ authority of Y'. Symlink targets may:
+
+ 1. Fully qualified URIs
+
+ fs://hostX/A/B/file Resolved according to the target file system.
+
+ 2. Partially qualified URIs (eg scheme but no host)
+
+ fs:///A/B/file Resolved according to the target file sytem. Eg resolving
+ a symlink to hdfs:///A results in an exception because
+ HDFS URIs must be fully qualified, while a symlink to
+ file:///A will not since Hadoop's local file systems
+ require partially qualified URIs.
+
+ 3. Relative paths
+
+ path Resolves to [Y'][path]. Eg if Y resolves to hdfs://host/A and path
+ is "../B/file" then [Y'][path] is hdfs://host/B/file
+
+ 4. Absolute paths
+
+ path Resolves to [X'][path]. Eg if Y resolves hdfs://host/A/B and path
+ is "/file" then [X][path] is hdfs://host/file
+
+
+ @param target the target of the symbolic link
+ @param link the path to be created that points to target
+ @param createParent if true then missing parent dirs are created if
+ false then parent must exist
+
+
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file linkcode> already exists
+ @throws FileNotFoundException If target does not exist
+ @throws ParentNotDirectoryException If parent of link is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for
+ target or link is not supported
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+ f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ *** Path Names ***
+
+
+ The Hadoop file system supports a URI name space and URI names.
+ It offers a forest of file systems that can be referenced using fully
+ qualified URIs.
+ Two common Hadoop file systems implementations are
+
+
the local file system: file:///path
+
the hdfs file system hdfs://nnAddress:nnPort/path
+
+
+ While URI names are very flexible, it requires knowing the name or address
+ of the server. For convenience one often wants to access the default system
+ in one's environment without knowing its name/address. This has an
+ additional benefit that it allows one to change one's default fs
+ (e.g. admin moves application from cluster1 to cluster2).
+
+
+ To facilitate this, Hadoop supports a notion of a default file system.
+ The user can set his default file system, although this is
+ typically set up for you in your environment via your default config.
+ A default file system implies a default scheme and authority; slash-relative
+ names (such as /for/bar) are resolved relative to that default FS.
+ Similarly a user can also have working-directory-relative names (i.e. names
+ not starting with a slash). While the working directory is generally in the
+ same default FS, the wd can be in a different FS.
+
+ Hence Hadoop path names can be one of:
+
+
fully qualified URI: scheme://authority/path
+
slash relative names: /path relative to the default file system
+
wd-relative names: path relative to the working dir
+
+ Relative paths with scheme (scheme:foo/bar) are illegal.
+
+
+ ****The Role of the FileContext and configuration defaults****
+
+ The FileContext provides file namespace context for resolving file names;
+ it also contains the umask for permissions, In that sense it is like the
+ per-process file-related state in Unix system.
+ These two properties
+
+
default file system i.e your slash)
+
umask
+
+ in general, are obtained from the default configuration file
+ in your environment, (@see {@link Configuration}).
+
+ No other configuration parameters are obtained from the default config as
+ far as the file context layer is concerned. All file system instances
+ (i.e. deployments of file systems) have default properties; we call these
+ server side (SS) defaults. Operation like create allow one to select many
+ properties: either pass them in as explicit parameters or use
+ the SS properties.
+
+ The file system related SS defaults are
+
+
the home directory (default is "/user/userName")
+
the initial wd (only for local fs)
+
replication factor
+
block size
+
buffer size
+
bytesPerChecksum (if used).
+
+
+
+ *** Usage Model for the FileContext class ***
+
+ Example 1: use the default config read from the $HADOOP_CONFIG/core.xml.
+ Unspecified values come from core-defaults.xml in the release jar.
+
+
myFContext = FileContext.getFileContext(); // uses the default config
+ // which has your default FS
+
myFContext.create(path, ...);
+
myFContext.setWorkingDir(path)
+
myFContext.open (path, ...);
+
+ Example 2: Get a FileContext with a specific URI as the default FS
+
+
myFContext = FileContext.getFileContext(URI)
+
myFContext.create(path, ...);
+ ...
+
+ Example 3: FileContext with local file system as the default
+
+ Example 4: Use a specific config, ignoring $HADOOP_CONFIG
+ Generally you should not need use a config unless you are doing
+
+
configX = someConfigSomeOnePassedToYou.
+
myFContext = getFileContext(configX); // configX is not changed,
+ // is passed down
+
myFContext.create(path, ...);
+
...
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ path could
+ not be resolved
+ @throws IOException an I/O error occured]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f is
+ not supported
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for
+ f is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ files does not
+ exist
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f is
+ not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+ Return all the files that match filePattern and are not checksum
+ files. Results are sorted by their names.
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note: character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single char that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+
+ @throws AccessControlException If access is denied
+ @throws UnsupportedFileSystemException If file system for
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+ pathPattern is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dst already exists
+ @throws FileNotFoundException If src does not exist
+ @throws ParentNotDirectoryException If parent of dst is not
+ a directory
+ @throws UnsupportedFileSystemException If file system for
+ src or dst is not supported
+ @throws IOException If an I/O error occurred
+
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+
+ RuntimeExceptions:
+ @throws InvalidPathException If path dst is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.
+ This always returns a new FileSystem object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ If OVERWRITE option is not passed as an argument, rename fails
+ if the dst already exists.
+
+ If OVERWRITE option is passed as an argument, rename overwrites
+ the dst if it is a file or an empty directory. Rename fails if dst is
+ a non-empty directory.
+
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for
+ details. This default implementation is non atomic.
+
+ This method is deprecated since it is a temporary method added to
+ support the transition from FileSystem to FileContext for user
+ applications.
+
+ @param src path to be renamed
+ @param dst new path after rename
+ @throws IOException on failure]]>
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+ f does not exist
+ @throws IOException if any I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+This pages describes how to use Kosmos Filesystem
+( KFS ) as a backing
+store with Hadoop. This page assumes that you have downloaded the
+KFS software and installed necessary binaries as outlined in the KFS
+documentation.
+
+
Steps
+
+
+
In the Hadoop conf directory edit core-site.xml,
+ add the following:
+
In the Hadoop conf directory edit core-site.xml,
+ adding the following (with appropriate values for
+ <server> and <port>):
+
+<property>
+ <name>fs.default.name</name>
+ <value>kfs://<server:port></value>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerHost</name>
+ <value><server></value>
+ <description>The location of the KFS meta server.</description>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerPort</name>
+ <value><port></value>
+ <description>The location of the meta server's port.</description>
+</property>
+
+
+
+
+
Copy KFS's kfs-0.1.jar to Hadoop's lib directory. This step
+ enables Hadoop's to load the KFS specific modules. Note
+ that, kfs-0.1.jar was built when you compiled KFS source
+ code. This jar file contains code that calls KFS's client
+ library code via JNI; the native code is in KFS's
+ libkfsClient.so library.
+
+
+
When the Hadoop map/reduce trackers start up, those
+processes (on local as well as remote nodes) will now need to load
+KFS's libkfsClient.so library. To simplify this process, it is advisable to
+store libkfsClient.so in an NFS accessible directory (similar to where
+Hadoop binaries/scripts are stored); then, modify Hadoop's
+conf/hadoop-env.sh adding the following line and providing suitable
+value for <path>:
+
+export LD_LIBRARY_PATH=<path>
+
+
+
+
Start only the map/reduce trackers
+
+ example: execute Hadoop's bin/start-mapred.sh
+
+
+
+If the map/reduce job trackers start up, all file-I/O is done to KFS.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ mode is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is a tool for migrating data from an older to a newer version
+ of an S3 filesystem.
+
+
+ All files in the filesystem are migrated by re-writing the block metadata
+ - no datafiles are touched.
+
+Files are stored in S3 as blocks (represented by
+{@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length.
+Block metadata is stored in S3 as a small record (represented by
+{@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded
+path string as a key. Inodes record the file type (regular file or directory) and the list of blocks.
+This design makes it easy to seek to any given position in a file by reading the inode data to compute
+which block to access, then using S3's support for
+HTTP Range headers
+to start streaming from the correct position.
+Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since
+S3 does not support renames).
+
+
+For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3
+would be something like this:
+
+Inodes start with a leading /, while blocks are prefixed with block-.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If f is a file, this method will make a single call to S3.
+ If f is a directory, this method will make a maximum of
+ (n / 1000) + 2 calls to S3, where n is the total number of
+ files and directories contained directly in f.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link FileSystem} for reading and writing files stored on
+ Amazon S3.
+ Unlike {@link org.apache.hadoop.fs.s3.S3FileSystem} this implementation
+ stores files on S3 in their
+ native form so they can be read by other S3 tools.
+
+ A note about directories. S3 of course has no "native" support for them.
+ The idiom we choose then is: for any directory created by this class,
+ we use an empty object "#{dirpath}_$folder$" as a marker.
+ Further, to interoperate with other S3 tools, we also accept the following:
+ - an object "#{dirpath}/' denoting a directory marker
+ - if there exists any objects with the prefix "#{dirpath}/", then the
+ directory is said to exist
+ - if both a file with the name of a directory and a marker for that
+ directory exists, then the *file masks the directory*, and the directory
+ is never returned.
+
+ @see org.apache.hadoop.fs.s3.S3FileSystem]]>
+
+
+
+
+
+A distributed implementation of {@link
+org.apache.hadoop.fs.FileSystem} for reading and writing files on
+Amazon S3.
+Unlike {@link org.apache.hadoop.fs.s3.S3FileSystem}, which is block-based,
+this implementation stores
+files on S3 in their native form for interoperability with other S3 tools.
+]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nth value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nth value in the file.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ public class IntArrayWritable extends ArrayWritable {
+ public IntArrayWritable() {
+ super(IntWritable.class);
+ }
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a ByteWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ value argument is null or
+ its size is zero, the elementType argument must not be null. If
+ the argument value's size is bigger than zero, the argument
+ elementType is not be used.
+
+ @param value
+ @param elementType]]>
+
+
+
+
+ value should not be null
+ or empty.
+
+ @param value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ value and elementType. If the value argument
+ is null or its size is zero, the elementType argument must not be
+ null. If the argument value's size is bigger than zero, the
+ argument elementType is not be used.
+
+ @param value
+ @param elementType]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is an EnumSetWritable with the same value,
+ or both are null.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
Malicious user removes his task's syslog file, and puts a link to the
+ jobToken file of a target user.
+
Malicious user tries to open the syslog file via the servlet on the
+ tasktracker.
+
The tasktracker is unaware of the symlink, and simply streams the contents
+ of the jobToken file. The malicious user can now access potentially sensitive
+ map outputs, etc. of the target user's job.
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = 0. Otherwise,
+ the length is not available.
+ @return The opened stream.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements
+ WritableComparable<MyWritableComparable> {
+
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable other) {
+ int thisValue = this.counter;
+ int thatValue = other.counter;
+ return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @param conf the Configuration object which contains confs for creating or reinit the compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+ (Both native and non-native versions of various Decompressors require
+ that the data passed in via b[] remain unmodified until
+ the caller is explicitly notified--via {@link #needsInput()}--that the
+ buffer may be safely modified. With this requirement, an extra
+ buffer-copy can be avoided.)
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ {@link #setInput(byte[], int, int)} should be called in
+ order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the decompressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Seek by key or by file offset.
+
+ The memory footprint of a TFile includes the following:
+
+
Some constant overhead of reading or writing a compressed block.
+
+
Each compressed block requires one compression/decompression codec for
+ I/O.
+
Temporary space to buffer the key.
+
Temporary space to buffer the value (for TFile.Writer only). Values are
+ chunk encoded, so that we buffer at most one chunk of user data. By default,
+ the chunk buffer is 1MB. Reading chunked value does not require additional
+ memory.
+
+
TFile index, which is proportional to the total number of Data Blocks.
+ The total amount of memory needed to hold the index can be estimated as
+ (56+AvgKeySize)*NumBlocks.
+
MetaBlock index, which is proportional to the total number of Meta
+ Blocks.The total amount of memory needed to hold the index for Meta Blocks
+ can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.
+
+
+ The behavior of TFile can be customized by the following variables through
+ Configuration:
+
+
tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default
+ to 1MB. Values of the length less than the chunk size is guaranteed to have
+ known value length in read time (See
+ {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
+
tfile.fs.output.buffer.size: Buffer size used for
+ FSDataOutputStream. Integer (in bytes). Default to 256KB.
+
tfile.fs.input.buffer.size: Buffer size used for
+ FSDataInputStream. Integer (in bytes). Default to 256KB.
+
+
+ Suggestions on performance optimization.
+
+
Minimum block size. We recommend a setting of minimum block size between
+ 256KB to 1MB for general usage. Larger block size is preferred if files are
+ primarily for sequential access. However, it would lead to inefficient random
+ access (because there are more data to decompress). Smaller blocks are good
+ for random access, but require more memory to hold the block index, and may
+ be slower to create (because we must flush the compressor stream at the
+ conclusion of each data block, which leads to an FS I/O flush). Further, due
+ to the internal caching in Compression codec, the smallest possible block
+ size would be around 20KB-30KB.
+
The current implementation does not offer true multi-threading for
+ reading. The implementation uses FSDataInputStream seek()+read(), which is
+ shown to be much faster than positioned-read call in single thread mode.
+ However, it also means that if multiple threads attempt to access the same
+ TFile (using multiple scanners) simultaneously, the actual I/O is carried out
+ sequentially even if they access different DFS blocks.
+
Compression codec. Use "none" if the data is not very compressable (by
+ compressable, I mean a compression ratio at least 2:1). Generally, use "lzo"
+ as the starting point for experimenting. "gz" overs slightly better
+ compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to
+ decompress, comparing to "lzo".
+
File system buffering, if the underlying FSDataInputStream and
+ FSDataOutputStream is already adequately buffered; or if applications
+ reads/writes keys and values in large buffers, we can reduce the sizes of
+ input/output buffering in TFile layer by setting the configuration parameters
+ "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
+
+
+ Some design rationale behind TFile can be found at Hadoop-3315.]]>
+
+ Use {@link Scanner#advance()} to move the cursor to the next key-value
+ pair (or end if none exists). Use seekTo methods (
+ {@link Scanner#seekTo(byte[])} or
+ {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary
+ location in the covered range (including backward seeking). Use
+ {@link Scanner#rewind()} to seek back to the beginning of the scanner.
+ Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.
+
+ Actual keys and values may be obtained through {@link Scanner.Entry}
+ object, which is obtained through {@link Scanner#entry()}.]]>
+
Algorithmic comparator: binary comparators that is language
+ independent. Currently, only "memcmp" is supported.
+
Language-specific comparator: binary comparators that can
+ only be constructed in specific language. For Java, the syntax
+ is "jclass:", followed by the class name of the RawComparator.
+ Currently, we only support RawComparators that can be
+ constructed through the default constructor (with no
+ parameters). Parameterized RawComparators such as
+ {@link WritableComparator} or
+ {@link JavaSerializationComparator} may not be directly used.
+ One should write a wrapper class that inherits from such classes
+ and use its default constructor to perform proper
+ initialization.
+
+ @param conf
+ The configuration object.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If an exception is thrown, the TFile will be in an inconsistent
+ state. The only legitimate call after that would be close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Utils#writeVLong(out, n).
+
+ @param out
+ output stream
+ @param n
+ The integer to be encoded
+ @throws IOException
+ @see Utils#writeVLong(DataOutput, long)]]>
+
+
+
+
+
+
+
+
+
if n in [-32, 127): encode in one byte with the actual value.
+ Otherwise,
+
if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52;
+ byte[1]=n&0xff. Otherwise,
+
if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 -
+ 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
+
if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112;
+ byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
+
if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] =
+ (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
+
if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] =
+ (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff;
+ byte[4]=(n>>8)&0xff; byte[5]=n&0xff
+
if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] =
+ (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff;
+ byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
+
if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] =
+ (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff;
+ byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff;
+ byte[7]=n&0xff;
+
if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] =
+ (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff;
+ byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff;
+ byte[7]=(n>>8)&0xff; byte[8]=n&0xff;
+
+
+ @param out
+ output stream
+ @param n
+ the integer number
+ @throws IOException]]>
+
if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
+
if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 +
+ NB[1]&0xff;
+
if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 +
+ (NB[1]&0xff)<<8 + NB[2]&0xff;
+
if (FB in [-128, -121]), return interpret NB[FB+129] as a signed
+ big-endian integer.
+
+ @param in
+ input stream
+ @return the decoded long integer.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ errno result codes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+This package provides a mechanism for using different serialization frameworks
+in Hadoop. The property "io.serializations" defines a list of
+{@link org.apache.hadoop.io.serializer.Serialization}s that know how to create
+{@link org.apache.hadoop.io.serializer.Serializer}s and
+{@link org.apache.hadoop.io.serializer.Deserializer}s.
+
+
+
+To add a new serialization framework write an implementation of
+{@link org.apache.hadoop.io.serializer.Serialization} and add its name to the
+"io.serializations" property.
+
+Use {@link org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization} for
+serialization of classes generated by Avro's 'specific' compiler.
+
+
+
+Use {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} for
+other classes.
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} work for
+any class which is either in the package list configured via
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization#AVRO_REFLECT_PACKAGES}
+or implement {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerializable}
+interface.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+The API is abstract so that it can be implemented on top of
+a variety of metrics client libraries. The choice of
+client library is a configuration option, and different
+modules within the same application can use
+different metrics implementation libraries.
+
+Sub-packages:
+
+
org.apache.hadoop.metrics.spi
+
The abstract Server Provider Interface package. Those wishing to
+ integrate the metrics API with a particular metrics client library should
+ extend this package.
+
+
org.apache.hadoop.metrics.file
+
An implementation package which writes the metric data to
+ a file, or sends it to the standard output stream.
+
+
org.apache.hadoop.metrics.ganglia
+
An implementation package which sends metric data to
+ Ganglia.
+
+
+
Introduction to the Metrics API
+
+Here is a simple example of how to use this package to report a single
+metric value:
+
The context name will typically identify either the application, or else a
+ module within an application or library.
+
+
myRecord
+
The record name generally identifies some entity for which a set of
+ metrics are to be reported. For example, you could have a record named
+ "cacheStats" for reporting a number of statistics relating to the usage of
+ some cache in your application.
+
+
myMetric
+
This identifies a particular metric. For example, you might have metrics
+ named "cache_hits" and "cache_misses".
+
+
+
+
Tags
+
+In some cases it is useful to have multiple records with the same name. For
+example, suppose that you want to report statistics about each disk on a computer.
+In this case, the record name would be something like "diskStats", but you also
+need to identify the disk which is done by adding a tag to the record.
+The code could look something like this:
+
+
+Data is not sent immediately to the metrics system when
+MetricsRecord.update() is called. Instead it is stored in an
+internal table, and the contents of the table are sent periodically.
+This can be important for two reasons:
+
+
It means that a programmer is free to put calls to this API in an
+ inner loop, since updates can be very frequent without slowing down
+ the application significantly.
+
Some implementations can gain efficiency by combining many metrics
+ into a single UDP message.
+
+
+The API provides a timer-based callback via the
+registerUpdater() method. The benefit of this
+versus using java.util.Timer is that the callbacks will be done
+immediately before sending the data, making the data as current as possible.
+
+
Configuration
+
+It is possible to programmatically examine and modify configuration data
+before creating a context, like this:
+
+The factory attributes can be examined and modified using the following
+ContextFactorymethods:
+
+
Object getAttribute(String attributeName)
+
String[] getAttributeNames()
+
void setAttribute(String name, Object value)
+
void removeAttribute(attributeName)
+
+
+
+ContextFactory.getFactory() initializes the factory attributes by
+reading the properties file hadoop-metrics.properties if it exists
+on the class path.
+
+
+A factory attribute named:
+
+contextName.class
+
+should have as its value the fully qualified name of the class to be
+instantiated by a call of the CodeFactory method
+getContext(contextName). If this factory attribute is not
+specified, the default is to instantiate
+org.apache.hadoop.metrics.file.FileContext.
+
+
+Other factory attributes are specific to a particular implementation of this
+API and are documented elsewhere. For example, configuration attributes for
+the file and Ganglia implementations can be found in the javadoc for
+their respective packages.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.fileName
+
The path of the file to which metrics in context contextName
+ are to be appended. If this attribute is not specified, the metrics
+ are written to standard output by default.
+
+
contextName.period
+
The period in seconds on which the metric data is written to the
+ file.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Implementation of the metrics package that sends metric data to
+Ganglia.
+Programmers should not normally need to use this package directly. Instead
+they should use org.hadoop.metrics.
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.servers
+
Space and/or comma separated sequence of servers to which UDP
+ messages should be sent.
+
+
contextName.period
+
The period in seconds on which the metric data is sent to the
+ server(s).
+
+
contextName.units.recordName.metricName
+
The units for the specified metric in the specified record.
+
+
contextName.slope.recordName.metricName
+
The slope for the specified metric in the specified record.
+
+
contextName.tmax.recordName.metricName
+
The tmax for the specified metric in the specified record.
+
+
contextName.dmax.recordName.metricName
+
The dmax for the specified metric in the specified record.
+
+ Software systems of any significant complexity require mechanisms for data
+interchange with the outside world. These interchanges typically involve the
+marshaling and unmarshaling of logical units of data to and from data streams
+(files, network connections, memory buffers etc.). Applications usually have
+some code for serializing and deserializing the data types that they manipulate
+embedded in them. The work of serialization has several features that make
+automatic code generation for it worthwhile. Given a particular output encoding
+(binary, XML, etc.), serialization of primitive types and simple compositions
+of primitives (structs, vectors etc.) is a very mechanical task. Manually
+written serialization code can be susceptible to bugs especially when records
+have a large number of fields or a record definition changes between software
+versions. Lastly, it can be very useful for applications written in different
+programming languages to be able to share and interchange data. This can be
+made a lot easier by describing the data records manipulated by these
+applications in a language agnostic manner and using the descriptions to derive
+implementations of serialization in multiple target languages.
+
+This document describes Hadoop Record I/O, a mechanism that is aimed
+at
+
+
enabling the specification of simple serializable data types (records)
+
enabling the generation of code in multiple target languages for
+marshaling and unmarshaling such types
+
providing target language specific support that will enable application
+programmers to incorporate generated code into their applications
+
+
+The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR,
+ASN.1, PADS and ICE. While these systems all include a DDL that enables
+the specification of most record types, they differ widely in what else they
+focus on. The focus in Hadoop Record I/O is on data marshaling and
+multi-lingual support. We take a translator-based approach to serialization.
+Hadoop users have to describe their data in a simple data description
+language. The Hadoop DDL translator rcc generates code that users
+can invoke in order to read/write their data from/to simple stream
+abstractions. Next we list explicitly some of the goals and non-goals of
+Hadoop Record I/O.
+
+
+
Goals
+
+
+
Support for commonly used primitive types. Hadoop should include as
+primitives commonly used builtin types from programming languages we intend to
+support.
+
+
Support for common data compositions (including recursive compositions).
+Hadoop should support widely used composite types such as structs and
+vectors.
+
+
Code generation in multiple target languages. Hadoop should be capable of
+generating serialization code in multiple target languages and should be
+easily extensible to new target languages. The initial target languages are
+C++ and Java.
+
+
Support for generated target languages. Hadooop should include support
+in the form of headers, libraries, packages for supported target languages
+that enable easy inclusion and use of generated code in applications.
+
+
Support for multiple output encodings. Candidates include
+packed binary, comma-separated text, XML etc.
+
+
Support for specifying record types in a backwards/forwards compatible
+manner. This will probably be in the form of support for optional fields in
+records. This version of the document does not include a description of the
+planned mechanism, we intend to include it in the next iteration.
+
+
+
+
Non-Goals
+
+
+
Serializing existing arbitrary C++ classes.
+
Serializing complex data structures such as trees, linked lists etc.
+
Built-in indexing schemes, compression, or check-sums.
+
Dynamic construction of objects from an XML schema.
+
+
+The remainder of this document describes the features of Hadoop record I/O
+in more detail. Section 2 describes the data types supported by the system.
+Section 3 lays out the DDL syntax with some examples of simple records.
+Section 4 describes the process of code generation with rcc. Section 5
+describes target language mappings and support for Hadoop types. We include a
+fairly complete description of C++ mappings with intent to include Java and
+others in upcoming iterations of this document. The last section talks about
+supported output encodings.
+
+
+
Data Types and Streams
+
+This section describes the primitive and composite types supported by Hadoop.
+We aim to support a set of types that can be used to simply and efficiently
+express a wide range of record types in different programming languages.
+
+
Primitive Types
+
+For the most part, the primitive types of Hadoop map directly to primitive
+types in high level programming languages. Special cases are the
+ustring (a Unicode string) and buffer types, which we believe
+find wide use and which are usually implemented in library code and not
+available as language built-ins. Hadoop also supplies these via library code
+when a target language built-in is not present and there is no widely
+adopted "standard" implementation. The complete list of primitive types is:
+
+
+
byte: An 8-bit unsigned integer.
+
boolean: A boolean value.
+
int: A 32-bit signed integer.
+
long: A 64-bit signed integer.
+
float: A single precision floating point number as described by
+ IEEE-754.
+
double: A double precision floating point number as described by
+ IEEE-754.
+
ustring: A string consisting of Unicode characters.
+
buffer: An arbitrary sequence of bytes.
+
+
+
+
Composite Types
+Hadoop supports a small set of composite types that enable the description
+of simple aggregate types and containers. A composite type is serialized
+by sequentially serializing it constituent elements. The supported
+composite types are:
+
+
+
+
record: An aggregate type like a C-struct. This is a list of
+typed fields that are together considered a single unit of data. A record
+is serialized by sequentially serializing its constituent fields. In addition
+to serialization a record has comparison operations (equality and less-than)
+implemented for it, these are defined as memberwise comparisons.
+
+
vector: A sequence of entries of the same data type, primitive
+or composite.
+
+
map: An associative container mapping instances of a key type to
+instances of a value type. The key and value types may themselves be primitive
+or composite types.
+
+
+
+
Streams
+
+Hadoop generates code for serializing and deserializing record types to
+abstract streams. For each target language Hadoop defines very simple input
+and output stream interfaces. Application writers can usually develop
+concrete implementations of these by putting a one method wrapper around
+an existing stream implementation.
+
+
+
DDL Syntax and Examples
+
+We now describe the syntax of the Hadoop data description language. This is
+followed by a few examples of DDL usage.
+
+
+
+A DDL file describes one or more record types. It begins with zero or
+more include declarations, a single mandatory module declaration
+followed by zero or more class declarations. The semantics of each of
+these declarations are described below:
+
+
+
+
include: An include declaration specifies a DDL file to be
+referenced when generating code for types in the current DDL file. Record types
+in the current compilation unit may refer to types in all included files.
+File inclusion is recursive. An include does not trigger code
+generation for the referenced file.
+
+
module: Every Hadoop DDL file must have a single module
+declaration that follows the list of includes and precedes all record
+declarations. A module declaration identifies a scope within which
+the names of all types in the current file are visible. Module names are
+mapped to C++ namespaces, Java packages etc. in generated code.
+
+
class: Records types are specified through class
+declarations. A class declaration is like a Java class declaration.
+It specifies a named record type and a list of fields that constitute records
+of the type. Usage is illustrated in the following examples.
+
+
+
+
Examples
+
+
+
A simple DDL file links.jr with just one record declaration.
+
+module links {
+ class Link {
+ ustring URL;
+ boolean isRelative;
+ ustring anchorText;
+ };
+}
+
+
+The Hadoop translator is written in Java. Invocation is done by executing a
+wrapper shell script named named rcc. It takes a list of
+record description files as a mandatory argument and an
+optional language argument (the default is Java) --language or
+-l. Thus a typical invocation would look like:
+
+$ rcc -l C++ ...
+
+
+
+
Target Language Mappings and Support
+
+For all target languages, the unit of code generation is a record type.
+For each record type, Hadoop generates code for serialization and
+deserialization, record comparison and access to record members.
+
+
C++
+
+Support for including Hadoop generated C++ code in applications comes in the
+form of a header file recordio.hh which needs to be included in source
+that uses Hadoop types and a library librecordio.a which applications need
+to be linked with. The header declares the Hadoop C++ namespace which defines
+appropriate types for the various primitives, the basic interfaces for
+records and streams and enumerates the supported serialization encodings.
+Declarations of these interfaces and a description of their semantics follow:
+
+
RecFormat: An enumeration of the serialization encodings supported
+by this implementation of Hadoop.
+
+
InStream: A simple abstraction for an input stream. This has a
+single public read method that reads n bytes from the stream into
+the buffer buf. Has the same semantics as a blocking read system
+call. Returns the number of bytes read or -1 if an error occurs.
+
+
OutStream: A simple abstraction for an output stream. This has a
+single write method that writes n bytes to the stream from the
+buffer buf. Has the same semantics as a blocking write system
+call. Returns the number of bytes written or -1 if an error occurs.
+
+
RecordReader: A RecordReader reads records one at a time from
+an underlying stream in a specified record format. The reader is instantiated
+with a stream and a serialization format. It has a read method that
+takes an instance of a record and deserializes the record from the stream.
+
+
RecordWriter: A RecordWriter writes records one at a
+time to an underlying stream in a specified record format. The writer is
+instantiated with a stream and a serialization format. It has a
+write method that takes an instance of a record and serializes the
+record to the stream.
+
+
Record: The base class for all generated record types. This has two
+public methods type and signature that return the typename and the
+type signature of the record.
+
+
+
+Two files are generated for each record file (note: not for each record). If a
+record file is named "name.jr", the generated files are
+"name.jr.cc" and "name.jr.hh" containing serialization
+implementations and record type declarations respectively.
+
+For each record in the DDL file, the generated header file will contain a
+class definition corresponding to the record type, method definitions for the
+generated type will be present in the '.cc' file. The generated class will
+inherit from the abstract class hadoop::Record. The DDL files
+module declaration determines the namespace the record belongs to.
+Each '.' delimited token in the module declaration results in the
+creation of a namespace. For instance, the declaration module docs.links
+results in the creation of a docs namespace and a nested
+docs::links namespace. In the preceding examples, the Link class
+is placed in the links namespace. The header file corresponding to
+the links.jr file will contain:
+
+
+namespace links {
+ class Link : public hadoop::Record {
+ // ....
+ };
+};
+
+
+Each field within the record will cause the generation of a private member
+declaration of the appropriate type in the class declaration, and one or more
+acccessor methods. The generated class will implement the serialize and
+deserialize methods defined in hadoop::Record+. It will also
+implement the inspection methods type and signature from
+hadoop::Record. A default constructor and virtual destructor will also
+be generated. Serialization code will read/write records into streams that
+implement the hadoop::InStream and the hadoop::OutStream interfaces.
+
+For each member of a record an accessor method is generated that returns
+either the member or a reference to the member. For members that are returned
+by value, a setter method is also generated. This is true for primitive
+data members of the types byte, int, long, boolean, float and
+double. For example, for a int field called MyField the folowing
+code is generated.
+
+
+
+For a ustring or buffer or composite field. The generated code
+only contains accessors that return a reference to the field. A const
+and a non-const accessor are generated. For example:
+
+
+
+Code generation for Java is similar to that for C++. A Java class is generated
+for each record type with private members corresponding to the fields. Getters
+and setters for fields are also generated. Some differences arise in the
+way comparison is expressed and in the mapping of modules to packages and
+classes to files. For equality testing, an equals method is generated
+for each record type. As per Java requirements a hashCode method is also
+generated. For comparison a compareTo method is generated for each
+record type. This has the semantics as defined by the Java Comparable
+interface, that is, the method returns a negative integer, zero, or a positive
+integer as the invoked object is less than, equal to, or greater than the
+comparison parameter.
+
+A .java file is generated per record type as opposed to per DDL
+file as in C++. The module declaration translates to a Java
+package declaration. The module name maps to an identical Java package
+name. In addition to this mapping, the DDL compiler creates the appropriate
+directory hierarchy for the package and places the generated .java
+files in the correct directories.
+
+
Mapping Summary
+
+
+DDL Type C++ Type Java Type
+
+boolean bool boolean
+byte int8_t byte
+int int32_t int
+long int64_t long
+float float float
+double double double
+ustring std::string java.lang.String
+buffer std::string org.apache.hadoop.record.Buffer
+class type class type class type
+vector std::vector java.util.ArrayList
+map std::map java.util.TreeMap
+
+
+
Data encodings
+
+This section describes the format of the data encodings supported by Hadoop.
+Currently, three data encodings are supported, namely binary, CSV and XML.
+
+
Binary Serialization Format
+
+The binary data encoding format is fairly dense. Serialization of composite
+types is simply defined as a concatenation of serializations of the constituent
+elements (lengths are included in vectors and maps).
+
+Composite types are serialized as follows:
+
+
class: Sequence of serialized members.
+
vector: The number of elements serialized as an int. Followed by a
+sequence of serialized elements.
+
map: The number of key value pairs serialized as an int. Followed
+by a sequence of serialized (key,value) pairs.
+
+
+Serialization of primitives is more interesting, with a zero compression
+optimization for integral types and normalization to UTF-8 for strings.
+Primitive types are serialized as follows:
+
+
+
byte: Represented by 1 byte, as is.
+
boolean: Represented by 1-byte (0 or 1)
+
int/long: Integers and longs are serialized zero compressed.
+Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a
+sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents
+the number of trailing bytes, N, as the negative number (-120-N). For example,
+the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'.
+This doesn't help much for 4-byte integers but does a reasonably good job with
+longs without bit twiddling.
+
float/double: Serialized in IEEE 754 single and double precision
+format in network byte order. This is the format used by Java.
+
ustring: Serialized as 4-byte zero compressed length followed by
+data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native
+language representation.
+
buffer: Serialized as a 4-byte zero compressed length followed by the
+raw bytes in the buffer.
+
+
+
+
CSV Serialization Format
+
+The CSV serialization format has a lot more structure than the "standard"
+Excel CSV format, but we believe the additional structure is useful because
+
+
+
it makes parsing a lot easier without detracting too much from legibility
+
the delimiters around composites make it obvious when one is reading a
+sequence of Hadoop records
+
+
+Serialization formats for the various types are detailed in the grammar that
+follows. The notable feature of the formats is the use of delimiters for
+indicating the certain field types.
+
+
+
A string field begins with a single quote (').
+
A buffer field begins with a sharp (#).
+
A class, vector or map begins with 's{', 'v{' or 'm{' respectively and
+ends with '}'.
+
+
+The CSV format can be described by the following grammar:
+
+
+
+The XML serialization format is the same used by Apache XML-RPC
+(http://ws.apache.org/xmlrpc/types.html). This is an extension of the original
+XML-RPC format and adds some additional data types. All record I/O types are
+not directly expressible in this format, and access to a DDL is required in
+order to convert these to valid types. All types primitive or composite are
+represented by <value> elements. The particular XML-RPC type is
+indicated by a nested element in the <value> element. The encoding for
+records is always UTF-8. Primitive types are serialized as follows:
+
+
+
byte: XML tag <ex:i1>. Values: 1-byte unsigned
+integers represented in US-ASCII
+
boolean: XML tag <boolean>. Values: "0" or "1"
+
int: XML tags <i4> or <int>. Values: 4-byte
+signed integers represented in US-ASCII.
+
long: XML tag <ex:i8>. Values: 8-byte signed integers
+represented in US-ASCII.
+
float: XML tag <ex:float>. Values: Single precision
+floating point numbers represented in US-ASCII.
+
double: XML tag <double>. Values: Double precision
+floating point numbers represented in US-ASCII.
+
ustring: XML tag <;string>. Values: String values
+represented as UTF-8. XML does not permit all Unicode characters in literal
+data. In particular, NULLs and control chars are not allowed. Additionally,
+XML processors are required to replace carriage returns with line feeds and to
+replace CRLF sequences with line feeds. Programming languages that we work
+with do not impose these restrictions on string types. To work around these
+restrictions, disallowed characters and CRs are percent escaped in strings.
+The '%' character is also percent escaped.
+
buffer: XML tag <string&>. Values: Arbitrary binary
+data. Represented as hexBinary, each byte is replaced by its 2-byte
+hexadecimal representation.
+
+
+Composite types are serialized as follows:
+
+
+
class: XML tag <struct>. A struct is a sequence of
+<member> elements. Each <member> element has a <name>
+element and a <value> element. The <name> is a string that must
+match /[a-zA-Z][a-zA-Z0-9_]*/. The value of the member is represented
+by a <value> element.
+
+
vector: XML tag <array<. An <array> contains a
+single <data> element. The <data> element is a sequence of
+<value> elements each of which represents an element of the vector.
+
+
map: XML tag <array>. Same as vector.
+
+
+
+For example:
+
+
+class {
+ int MY_INT; // value 5
+ vector MY_VEC; // values 0.1, -0.89, 2.45e4
+ buffer MY_BUF; // value '\00\n\tabc%'
+}
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyMapper.class);
+ job.setReducerClass(MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ return 0;
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new MyApp(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.17.0.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.17.0.xml
new file mode 100644
index 0000000..69dded3
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.17.0.xml
@@ -0,0 +1,43272 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Hadoop by default specifies two resources, loaded in-order from the
+ classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The balancer is a tool that balances disk space usage on an HDFS cluster
+ when some datanodes become full or when new empty nodes join the cluster.
+ The tool is deployed as an application program that can be run by the
+ cluster administrator on a live HDFS cluster while applications
+ adding and deleting files.
+
+
SYNOPSIS
+
+ To start:
+ bin/start-balancer.sh [-threshold ]
+ Example: bin/ start-balancer.sh
+ start the balancer with a default threshold of 10%
+ bin/ start-balancer.sh -threshold 5
+ start the balancer with a threshold of 5%
+ To stop:
+ bin/ stop-balancer.sh
+
+
+
DESCRIPTION
+
The threshold parameter is a fraction in the range of (0%, 100%) with a
+ default value of 10%. The threshold sets a target for whether the cluster
+ is balanced. A cluster is balanced if for each datanode, the utilization
+ of the node (ratio of used space at the node to total capacity of the node)
+ differs from the utilization of the (ratio of used space in the cluster
+ to total capacity of the cluster) by no more than the threshold value.
+ The smaller the threshold, the more balanced a cluster will become.
+ It takes more time to run the balancer for small threshold values.
+ Also for a very small threshold the cluster may not be able to reach the
+ balanced state when applications write and delete files concurrently.
+
+
The tool moves blocks from highly utilized datanodes to poorly
+ utilized datanodes iteratively. In each iteration a datanode moves or
+ receives no more than the lesser of 10G bytes or the threshold fraction
+ of its capacity. Each iteration runs no more than 20 minutes.
+ At the end of each iteration, the balancer obtains updated datanodes
+ information from the namenode.
+
+
A system property that limits the balancer's use of bandwidth is
+ defined in the default configuration file:
+
+
+ dfs.balance.bandwidthPerSec
+ 1048576
+ Specifies the maximum bandwidth that each datanode
+ can utilize for the balancing purpose in term of the number of bytes
+ per second.
+
+
+
+
This property determines the maximum speed at which a block will be
+ moved from one datanode to another. The default value is 1MB/s. The higher
+ the bandwidth, the faster a cluster can reach the balanced state,
+ but with greater competition with application processes. If an
+ administrator changes the value of this property in the configuration
+ file, the change is observed when HDFS is next restarted.
+
+
MONITERING BALANCER PROGRESS
+
After the balancer is started, an output file name where the balancer
+ progress will be recorded is printed on the screen. The administrator
+ can monitor the running of the balancer by reading the output file.
+ The output shows the balancer's status iteration by iteration. In each
+ iteration it prints the starting time, the iteration number, the total
+ number of bytes that have been moved in the previous iterations,
+ the total number of bytes that are left to move in order for the cluster
+ to be balanced, and the number of bytes that are being moved in this
+ iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left
+ To Move" is decreasing.
+
+
Running multiple instances of the balancer in an HDFS cluster is
+ prohibited by the tool.
+
+
The balancer automatically exits when any of the following five
+ conditions is satisfied:
+
+
The cluster is balanced;
+
No block can be moved;
+
No block has been moved for five consecutive iterations;
+
An IOException occurs while communicating with the namenode;
+
Another balancer is running.
+
+
+
Upon exit, a balancer returns an exit code and prints one of the
+ following messages to the output file in corresponding to the above exit
+ reasons:
+
+
The cluster is balanced. Exiting
+
No block can be moved. Exiting...
+
No block has been moved for 3 iterations. Exiting...
+
Received an IO exception: failure reason. Exiting...
+
Another balancer is running. Exiting...
+
+
+
The administrator can interrupt the execution of the balancer at any
+ time by running the command "stop-balancer.sh" on the machine where the
+ balancer is running.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ stream of bytes (of BLOCK_SIZE or less)
+
+ This info is stored on a local disk. The DataNode
+ reports the table's contents to the NameNode upon startup
+ and every so often afterwards.
+
+ DataNodes spend their lives in an endless loop of asking
+ the NameNode for something to do. A NameNode cannot connect
+ to a DataNode directly; a NameNode simply returns values from
+ functions invoked by a DataNode.
+
+ DataNodes maintain an open server socket so that client code
+ or other DataNodes can read/write data. The host/port for
+ this server is reported to the NameNode, which then sends that
+ information to clients or other DataNodes that might be interested.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link NamenodeFsck#FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link NamenodeFsck#FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
{@link FSConstants.StartupOption#REGULAR REGULAR} - normal startup
+
{@link FSConstants.StartupOption#FORMAT FORMAT} - format name node
+
{@link FSConstants.StartupOption#UPGRADE UPGRADE} - start the cluster
+ upgrade and create a snapshot of the current file system state
+
{@link FSConstants.StartupOption#ROLLBACK ROLLBACK} - roll the
+ cluster back to the previous state
+
+ The option is passed via configuration field:
+ dfs.namenode.startup
+
+ The conf will be modified to reflect the actual ports on which
+ the NameNode is up and running if the user passes the port as
+ zero in the conf.
+
+ @param conf confirguration
+ @throws IOException]]>
+
+
+
+
+
+ zero.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ datanode whose
+ total size is size
+
+ @param datanode on which blocks are located
+ @param size total size of blocks]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ blocksequence (namespace)
+ 2) block->machinelist ("inodes")
+
+ The first table is stored on disk and is very precious.
+ The second table is rebuilt every time the NameNode comes
+ up.
+
+ 'NameNode' refers to both this class as well as the 'NameNode server'.
+ The 'FSNamesystem' class actually performs most of the filesystem
+ management. The majority of the 'NameNode' class itself is concerned
+ with exposing the IPC interface to the outside world, plus some
+ configuration management.
+
+ NameNode implements the ClientProtocol interface, which allows
+ clients to ask for DFS services. ClientProtocol is not
+ designed for direct use by authors of DFS client code. End-users
+ should instead use the org.apache.nutch.hadoop.fs.FileSystem class.
+
+ NameNode also implements the DatanodeProtocol interface, used by
+ DataNode programs that actually store DFS data blocks. These
+ methods are invoked repeatedly and automatically by all the
+ DataNodes in a DFS deployment.
+
+ NameNode also implements the NamenodeProtocol interface, used by
+ secondary namenodes or rebalancing processes to get partial namenode's
+ state, for example partial blocksMap etc.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link #FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link #FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
delete corrupted files ({@link #FIXING_DELETE})
+
+
+
detect files with under-replicated or over-replicated blocks
+
+ Additionally, the tool collects a detailed overall DFS statistics, and
+ optionally can print detailed statistics on block locations and replication
+ factors of each file.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
The most important difference is that unlike GFS, Hadoop DFS files
+have strictly one writer at any one time. Bytes are always appended
+to the end of the writer's stream. There is no notion of "record appends"
+or "mutations" that are then checked or reordered. Writers simply emit
+a byte stream. That byte stream is guaranteed to be stored in the
+order written.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link JobConf}. The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip files) are un-archived at the slave nodes. Jars maybe be
+ optionally added to the classpath of the tasks, a rudimentary software
+ distribution mechanism. Files have execution permissions. Optionally users
+ can also direct it to symlink the distributed cache file(s) into
+ the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+
+ 3. Use the cached files in the {@link Mapper} or {@link Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is {@link DistributedFileSystem}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+This pages describes how to use Kosmos Filesystem
+( KFS ) as a backing
+store with Hadoop. This page assumes that you have downloaded the
+KFS software and installed necessary binaries as outlined in the KFS
+documentation.
+
+
Steps
+
+
+
In the Hadoop conf directory edit hadoop-default.xml,
+ add the following:
+
In the Hadoop conf directory edit hadoop-site.xml,
+ adding the following (with appropriate values for
+ <server> and <port>):
+
+<property>
+ <name>fs.default.name</name>
+ <value>kfs://<server:port></value>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerHost</name>
+ <value><server></value>
+ <description>The location of the KFS meta server.</description>
+</property>
+
+<property>
+ <name>fs.kfs.metaServerPort</name>
+ <value><port></value>
+ <description>The location of the meta server's port.</description>
+</property>
+
+
+
+
+
Copy KFS's kfs-0.1.jar to Hadoop's lib directory. This step
+ enables Hadoop's to load the KFS specific modules. Note
+ that, kfs-0.1.jar was built when you compiled KFS source
+ code. This jar file contains code that calls KFS's client
+ library code via JNI; the native code is in KFS's
+ libkfsClient.so library.
+
+
+
When the Hadoop map/reduce trackers start up, those
+processes (on local as well as remote nodes) will now need to load
+KFS's libkfsClient.so library. To simplify this process, it is advisable to
+store libkfsClient.so in an NFS accessible directory (similar to where
+Hadoop binaries/scripts are stored); then, modify Hadoop's
+conf/hadoop-env.sh adding the following line and providing suitable
+value for <path>:
+
+export LD_LIBRARY_PATH=<path>
+
+
+
+
Start only the map/reduce trackers
+
+ example: execute Hadoop's bin/start-mapred.sh
+Files are stored in S3 as blocks (represented by
+{@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length.
+Block metadata is stored in S3 as a small record (represented by
+{@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded
+path string as a key. Inodes record the file type (regular file or directory) and the list of blocks.
+This design makes it easy to seek to any given position in a file by reading the inode data to compute
+which block to access, then using S3's support for
+HTTP Range headers
+to start streaming from the correct position.
+Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since
+S3 does not support renames).
+
+
+For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3
+would be something like this:
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+A mechanism for selectively retrying methods that throw exceptions under certain circumstances.
+
+
+
+This will retry any method called on unreliable four times - in this case the call()
+method - sleeping 10 seconds between
+each retry. There are a number of {@link org.apache.hadoop.io.retry.RetryPolicies retry policies}
+available, or you can implement a custom one by implementing {@link org.apache.hadoop.io.retry.RetryPolicy}.
+It is also possible to specify retry policies on a
+{@link org.apache.hadoop.io.retry.RetryProxy#create(Class, Object, Map) per-method basis}.
+
]]>
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+This package provides a mechanism for using different serialization frameworks
+in Hadoop. The property "io.serializations" defines a list of
+{@link org.apache.hadoop.io.serializer.Serialization}s that know how to create
+{@link org.apache.hadoop.io.serializer.Serializer}s and
+{@link org.apache.hadoop.io.serializer.Deserializer}s.
+
+
+
+To add a new serialization framework write an implementation of
+{@link org.apache.hadoop.io.serializer.Serialization} and add its name to the
+"io.serializations" property.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcDiscardedOps}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides generic implementations of
+ {@link #validateInput(JobConf)} and {@link #getSplits(JobConf, int)}.
+ Implementations fo FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the taskid, say
+ task_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This method is used to validate the input directories when a job is
+ submitted so that the {@link JobClient} can fail early, with an useful
+ error message, in case of errors. For e.g. input directory does not exist.
+
+
+ @param job job configuration.
+ @throws InvalidInputException if the job does not have valid input]]>
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. Typically all values are combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
A Map-Reduce job usually splits the input data-set into independent
+chunks which processed by map tasks in completely parallel manner,
+followed by reduce tasks which aggregating their output. Typically both
+the input and the output of the job are stored in a
+{@link org.apache.hadoop.fs.FileSystem}. The framework takes care of monitoring
+tasks and re-executing failed ones. Since, usually, the compute nodes and the
+storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed
+FileSystem are running on the same set of nodes, tasks are effectively scheduled
+on the nodes where data is already present, resulting in very high aggregate
+bandwidth across the cluster.
+
+
The Map-Reduce framework operates exclusively on <key, value>
+pairs i.e. the input to the job is viewed as a set of <key, value>
+pairs and the output as another, possibly different, set of
+<key, value> pairs. The keys and values have to
+be serializable as {@link org.apache.hadoop.io.Writable}s and additionally the
+keys have to be {@link org.apache.hadoop.io.WritableComparable}s in
+order to facilitate grouping by the framework.
+
+
Data flow:
+
+ (input)
+ <k1, v1>
+
+ |
+ V
+
+ map
+
+ |
+ V
+
+ <k2, v2>
+
+ |
+ V
+
+ combine
+
+ |
+ V
+
+ <k2, v2>
+
+ |
+ V
+
+ reduce
+
+ |
+ V
+
+ <k3, v3>
+ (output)
+
+
+
Applications typically implement
+{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)}
+and
+{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)}
+methods. The application-writer also specifies various facets of the job such
+as input and output locations, the Partitioner, InputFormat
+& OutputFormat implementations to be used etc. as
+a {@link org.apache.hadoop.mapred.JobConf}. The client program,
+{@link org.apache.hadoop.mapred.JobClient}, then submits the job to the framework
+and optionally monitors it.
+
+
The framework spawns one map task per
+{@link org.apache.hadoop.mapred.InputSplit} generated by the
+{@link org.apache.hadoop.mapred.InputFormat} of the job and calls
+{@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)}
+with each <key, value> pair read by the
+{@link org.apache.hadoop.mapred.RecordReader} from the InputSplit for
+the task. The intermediate outputs of the maps are then grouped by keys
+and optionally aggregated by combiner. The key space of intermediate
+outputs are paritioned by the {@link org.apache.hadoop.mapred.Partitioner}, where
+the number of partitions is exactly the number of reduce tasks for the job.
+
+
The reduce tasks fetch the sorted intermediate outputs of the maps, via http,
+merge the <key, value> pairs and call
+{@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)}
+for each <key, list of values> pair. The output of the reduce tasks' is
+stored on the FileSystem by the
+{@link org.apache.hadoop.mapred.RecordWriter} provided by the
+{@link org.apache.hadoop.mapred.OutputFormat} of the job.
+
+
Map-Reduce application to perform a distributed grep:
+
+public class Grep extends Configured implements Tool {
+
+ // map: Search for the pattern specified by 'grep.mapper.regex' &
+ // 'grep.mapper.regex.group'
+
+ class GrepMapper<K, Text>
+ extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> {
+
+ private Pattern pattern;
+ private int group;
+
+ public void configure(JobConf job) {
+ pattern = Pattern.compile(job.get("grep.mapper.regex"));
+ group = job.getInt("grep.mapper.regex.group", 0);
+ }
+
+ public void map(K key, Text value,
+ OutputCollector<Text, LongWritable> output,
+ Reporter reporter)
+ throws IOException {
+ String text = value.toString();
+ Matcher matcher = pattern.matcher(text);
+ while (matcher.find()) {
+ output.collect(new Text(matcher.group(group)), new LongWritable(1));
+ }
+ }
+ }
+
+ // reduce: Count the number of occurrences of the pattern
+
+ class GrepReducer<K> extends MapReduceBase
+ implements Reducer<K, LongWritable, K, LongWritable> {
+
+ public void reduce(K key, Iterator<LongWritable> values,
+ OutputCollector<K, LongWritable> output,
+ Reporter reporter)
+ throws IOException {
+
+ // sum all values for this key
+ long sum = 0;
+ while (values.hasNext()) {
+ sum += values.next().get();
+ }
+
+ // output sum
+ output.collect(key, new LongWritable(sum));
+ }
+ }
+
+ public int run(String[] args) throws Exception {
+ if (args.length < 3) {
+ System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
+ ToolRunner.printGenericCommandUsage(System.out);
+ return -1;
+ }
+
+ JobConf grepJob = new JobConf(getConf(), Grep.class);
+
+ grepJob.setJobName("grep");
+
+ grepJob.setInputPath(new Path(args[0]));
+ grepJob.setOutputPath(args[1]);
+
+ grepJob.setMapperClass(GrepMapper.class);
+ grepJob.setCombinerClass(GrepReducer.class);
+ grepJob.setReducerClass(GrepReducer.class);
+
+ grepJob.set("mapred.mapper.regex", args[2]);
+ if (args.length == 4)
+ grepJob.set("mapred.mapper.regex.group", args[3]);
+
+ grepJob.setOutputFormat(SequenceFileOutputFormat.class);
+ grepJob.setOutputKeyClass(Text.class);
+ grepJob.setOutputValueClass(LongWritable.class);
+
+ JobClient.runJob(grepJob);
+
+ return 0;
+ }
+
+ public static void main(String[] args) throws Exception {
+ int res = ToolRunner.run(new Configuration(), new Grep(), args);
+ System.exit(res);
+ }
+
+}
+
+
+
Notice how the data-flow of the above grep job is very similar to doing the
+same via the unix pipeline:
+
+
+cat input/* | grep | sort | uniq -c > out
+
+
+
+ input | map | shuffle | reduce > out
+
+
+
Hadoop Map-Reduce applications need not be written in
+JavaTM only.
+Hadoop Streaming is a utility
+which allows users to create and run jobs with any executables (e.g. shell
+utilities) as the mapper and/or the reducer.
+Hadoop Pipes is a
+SWIG-compatible C++ API to implement
+Map-Reduce applications (non JNITM based).
Operations included in this patch are partitioned into one of two types:
+join operations emitting tuples and "multi-filter" operations emitting a
+single value from (but not necessarily included in) a set of input values.
+For a given key, each operation will consider the cross product of all
+values for all sources at that node.
+
+
Identifiers supported by default:
+
+
+
identifier
type
description
+
inner
Join
Full inner join
+
outer
Join
Full outer join
+
override
MultiFilter
+
For a given key, prefer values from the rightmost source
+
+
+
A user of this class must set the InputFormat for the job to
+CompositeInputFormat and define a join expression accepted by the
+preceding grammar. For example, both of the following are acceptable:
CompositeInputFormat includes a handful of convenience methods to
+aid construction of these verbose statements.
+
+
As in the second example, joins may be nested. Users may provide a
+comparator class in the mapred.join.keycomparator property to specify
+the ordering of their keys, or accept the default comparator as returned by
+WritableComparator.get(keyclass).
+
+
Users can specify their own join operations, typically by overriding
+JoinRecordReader or MultiFilterRecordReader and mapping that
+class to an identifier in the join expression using the
+mapred.join.define.ident property, where ident is
+the identifier appearing in the join expression. Users may elect to emit- or
+modify- values passing through their join operation. Consulting the existing
+operations for guidance is recommended. Adding arguments is considerably more
+complex (and only partially supported), as one must also add a Node
+type to the parse tree. One is probably better off extending
+RecordReader in most cases.
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+
+Generally speaking, in order to implement an application using Map/Reduce
+model, the developer needs to implement Map and Reduce functions (and possibly
+Combine function). However, for a lot of applications related to counting and
+statistics computing, these functions have very similar
+characteristics. This provides a package implementing
+those patterns. In particular, the package provides a generic mapper class,
+a reducer class and a combiner class, and a set of built-in value aggregators.
+It also provides a generic utility class, ValueAggregatorJob, that offers a static function that
+creates map/reduce jobs:
+
+To call this function, the user needs to pass in arguments specifying the input directories, the output directory,
+the number of reducers, the input data format (textinputformat or sequencefileinputformat), and a file specifying user plugin class(es) to load by the mapper.
+A user plugin class is responsible for specifying what
+aggregators to use and what values are for which aggregators.
+A plugin class must implement the following interface:
+
+
+ public interface ValueAggregatorDescriptor {
+ public ArrayList<Entry> generateKeyValPairs(Object key, Object value);
+ public void configure(JobConfjob);
+}
+
+
+Function generateKeyValPairs will generate aggregation key/value pairs for the
+input key/value pair. Each aggregation key encodes two pieces of information: the aggregation type and aggregation ID.
+The value is the value to be aggregated onto the aggregation ID according to the aggregation type. Here
+is a simple example user plugin class for counting the words in the input texts:
+
+
+public class WordCountAggregatorDescriptor extends ValueAggregatorBaseDescriptor {
+ public ArrayList<Entry> generateKeyValPairs(Object key, Object val) {
+ String words [] = val.toString().split(" |\t");
+ ArrayList<Entry> retv = new ArrayList<Entry>();
+ for (int i = 0; i < words.length; i++) {
+ retv.add(generateEntry(LONG_VALUE_SUM, words[i], ONE))
+ }
+ return retv;
+ }
+ public void configure(JobConf job) {}
+}
+
+
+In the above code, LONG_VALUE_SUM is a string denoting the aggregation type LongValueSum, which sums over long values.
+ONE denotes a string "1". Function generateEntry(LONG_VALUE_SUM, words[i], ONE) will inperpret the first argument as an aggregation type, the second as an aggregation ID, and the third argumnent as the value to be aggregated. The output will look like: "LongValueSum:xxxx", where XXXX is the string value of words[i]. The value will be "1". The mapper will call generateKeyValPairs(Object key, Object val) for each input key/value pair to generate the desired aggregation id/value pairs.
+The down stream combiner/reducer will interpret these pairs as adding one to the aggregator XXXX.
+
+Class ValueAggregatorBaseDescriptor is a base class that user plugin classes can extend. Here is the XML fragment specifying the user plugin class:
+
+Thus, if no user plugin class is specified, the default behavior of the map/reduce job is to count the number of records (lines) in the imput files.
+
+During runtime, the mapper will invoke the generateKeyValPairs function for each input key/value pair, and emit the generated
+key/value pairs:
+
+The reducer will create an aggregator object for each key/value list pair, and perform the appropriate aggregation.
+At the end, it will emit the aggregator's results:
+
+In order to be able to use combiner, all the aggregation type be aggregators must be associative and communitive.
+The following are the types supported:
+
LongValueSum: sum over long values
+
DoubleValueSum: sum over float/double values
+
uniqValueCount: count the number of distinct values
+
ValueHistogram: compute the histogram of values compute the minimum, maximum, media,average, standard deviation of numeric values
+
+
+
Create and run an application
+
+To create an application, the user needs to do the following things:
+
+1. Implement a user plugin:
+
+
+The application programs link against a thin C++ wrapper library that
+handles the communication with the rest of the Hadoop system. The C++
+interface is "swigable" so that interfaces can be generated for python
+and other scripting languages. All of the C++ functions and classes
+are in the HadoopPipes namespace. The job may consist of any
+combination of Java and C++ RecordReaders, Mappers, Paritioner,
+Combiner, Reducer, and RecordWriter.
+
+
+
+Hadoop Pipes has a generic Java class for handling the mapper and
+reducer (PipesMapRunner and PipesReducer). They fork off the
+application program and communicate with it over a socket. The
+communication is handled by the C++ wrapper library and the
+PipesMapRunner and PipesReducer.
+
+
+
+The application program passes in a factory object that can create
+the various objects needed by the framework to the runTask
+function. The framework creates the Mapper or Reducer as
+appropriate and calls the map or reduce method to invoke the
+application's code. The JobConf is available to the application.
+
+
+
+The Mapper and Reducer objects get all of their inputs, outputs, and
+context via context objects. The advantage of using the context
+objects is that their interface can be extended with additional
+methods without breaking clients. Although this interface is different
+from the current Java interface, the plan is to migrate the Java
+interface in this direction.
+
+
+
+Although the Java implementation is typed, the C++ interfaces of keys
+and values is just a byte buffer. Since STL strings provide precisely
+the right functionality and are standard, they will be used. The
+decision to not use stronger types was to simplify the interface.
+
+
+
+The application can also define combiner functions. The combiner will
+be run locally by the framework in the application process to avoid
+the round trip to the Java process and back. Because the compare
+function is not available in C++, the combiner will use memcmp to
+sort the inputs to the combiner. This is not as general as the Java
+equivalent, which uses the user's comparator, but should cover the
+majority of the use cases. As the map function outputs key/value
+pairs, they will be buffered. When the buffer is full, it will be
+sorted and passed to the combiner. The output of the combiner will be
+sent to the Java process.
+
+
+
+The application can also set a partition function to control which key
+is given to a particular reduce. If a partition function is not
+defined, the Java one will be used. The partition function will be
+called by the C++ framework before the key/value pair is sent back to
+Java.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+The API is abstract so that it can be implemented on top of
+a variety of metrics client libraries. The choice of
+client library is a configuration option, and different
+modules within the same application can use
+different metrics implementation libraries.
+
+Sub-packages:
+
+
org.apache.hadoop.metrics.spi
+
The abstract Server Provider Interface package. Those wishing to
+ integrate the metrics API with a particular metrics client library should
+ extend this package.
+
+
org.apache.hadoop.metrics.file
+
An implementation package which writes the metric data to
+ a file, or sends it to the standard output stream.
+
+
org.apache.hadoop.metrics.ganglia
+
An implementation package which sends metric data to
+ Ganglia.
+
+
+
Introduction to the Metrics API
+
+Here is a simple example of how to use this package to report a single
+metric value:
+
The context name will typically identify either the application, or else a
+ module within an application or library.
+
+
myRecord
+
The record name generally identifies some entity for which a set of
+ metrics are to be reported. For example, you could have a record named
+ "cacheStats" for reporting a number of statistics relating to the usage of
+ some cache in your application.
+
+
myMetric
+
This identifies a particular metric. For example, you might have metrics
+ named "cache_hits" and "cache_misses".
+
+
+
+
Tags
+
+In some cases it is useful to have multiple records with the same name. For
+example, suppose that you want to report statistics about each disk on a computer.
+In this case, the record name would be something like "diskStats", but you also
+need to identify the disk which is done by adding a tag to the record.
+The code could look something like this:
+
+
+Data is not sent immediately to the metrics system when
+MetricsRecord.update() is called. Instead it is stored in an
+internal table, and the contents of the table are sent periodically.
+This can be important for two reasons:
+
+
It means that a programmer is free to put calls to this API in an
+ inner loop, since updates can be very frequent without slowing down
+ the application significantly.
+
Some implementations can gain efficiency by combining many metrics
+ into a single UDP message.
+
+
+The API provides a timer-based callback via the
+registerUpdater() method. The benefit of this
+versus using java.util.Timer is that the callbacks will be done
+immediately before sending the data, making the data as current as possible.
+
+
Configuration
+
+It is possible to programmatically examine and modify configuration data
+before creating a context, like this:
+
+The factory attributes can be examined and modified using the following
+ContextFactorymethods:
+
+
Object getAttribute(String attributeName)
+
String[] getAttributeNames()
+
void setAttribute(String name, Object value)
+
void removeAttribute(attributeName)
+
+
+
+ContextFactory.getFactory() initializes the factory attributes by
+reading the properties file hadoop-metrics.properties if it exists
+on the class path.
+
+
+A factory attribute named:
+
+contextName.class
+
+should have as its value the fully qualified name of the class to be
+instantiated by a call of the CodeFactory method
+getContext(contextName). If this factory attribute is not
+specified, the default is to instantiate
+org.apache.hadoop.metrics.file.FileContext.
+
+
+Other factory attributes are specific to a particular implementation of this
+API and are documented elsewhere. For example, configuration attributes for
+the file and Ganglia implementations can be found in the javadoc for
+their respective packages.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.fileName
+
The path of the file to which metrics in context contextName
+ are to be appended. If this attribute is not specified, the metrics
+ are written to standard output by default.
+
+
contextName.period
+
The period in seconds on which the metric data is written to the
+ file.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Implementation of the metrics package that sends metric data to
+Ganglia.
+Programmers should not normally need to use this package directly. Instead
+they should use org.hadoop.metrics.
+
+
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+
+
+
contextName.servers
+
Space and/or comma separated sequence of servers to which UDP
+ messages should be sent.
+
+
contextName.period
+
The period in seconds on which the metric data is sent to the
+ server(s).
+
+
contextName.units.recordName.metricName
+
The units for the specified metric in the specified record.
+
+
contextName.slope.recordName.metricName
+
The slope for the specified metric in the specified record.
+
+
contextName.tmax.recordName.metricName
+
The tmax for the specified metric in the specified record.
+
+
contextName.dmax.recordName.metricName
+
The dmax for the specified metric in the specified record.
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+ org.apache.hadoop.metrics.file and
+org.apache.hadoop.metrics.ganglia.
+
+Plugging in an implementation involves writing a concrete subclass of
+AbstractMetricsContext. The subclass should get its
+ configuration information using the getAttribute(attributeName)
+ method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see #getOutputStream(Socket, long)
+
+ @param socket
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ node
+
+ @param node
+ a node
+ @return true if node is already in the tree; false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ scope
+ if scope starts with ~, choose one from the all nodes except for the
+ ones in scope; otherwise, choose one from scope
+ @param scope range of nodes from which a node will be choosen
+ @return the choosen node]]>
+
+
+
+
+
+
+ scope but not in excludedNodes
+ if scope starts with ~, return the number of nodes that are not
+ in scope and excludedNodes;
+ @param scope a path string that may start with ~
+ @param excludedNodes a list of nodes
+ @return number of available nodes]]>
+
+
+
+
+
+
+
+
+
+
+
+ reader
+ It linearly scans the array, if a local node is found, swap it with
+ the first element of the array.
+ If a local rack node is found, swap it with the first element following
+ the local node.
+ If neither local node or local rack node is found, put a random replica
+ location at postion 0.
+ It leaves the rest nodes untouched.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new ouput stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = getCount().
+ @param newCapacity The new capacity in bytes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Introduction
+
+ Software systems of any significant complexity require mechanisms for data
+interchange with the outside world. These interchanges typically involve the
+marshaling and unmarshaling of logical units of data to and from data streams
+(files, network connections, memory buffers etc.). Applications usually have
+some code for serializing and deserializing the data types that they manipulate
+embedded in them. The work of serialization has several features that make
+automatic code generation for it worthwhile. Given a particular output encoding
+(binary, XML, etc.), serialization of primitive types and simple compositions
+of primitives (structs, vectors etc.) is a very mechanical task. Manually
+written serialization code can be susceptible to bugs especially when records
+have a large number of fields or a record definition changes between software
+versions. Lastly, it can be very useful for applications written in different
+programming languages to be able to share and interchange data. This can be
+made a lot easier by describing the data records manipulated by these
+applications in a language agnostic manner and using the descriptions to derive
+implementations of serialization in multiple target languages.
+
+This document describes Hadoop Record I/O, a mechanism that is aimed
+at
+
+
enabling the specification of simple serializable data types (records)
+
enabling the generation of code in multiple target languages for
+marshaling and unmarshaling such types
+
providing target language specific support that will enable application
+programmers to incorporate generated code into their applications
+
+
+The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR,
+ASN.1, PADS and ICE. While these systems all include a DDL that enables
+the specification of most record types, they differ widely in what else they
+focus on. The focus in Hadoop Record I/O is on data marshaling and
+multi-lingual support. We take a translator-based approach to serialization.
+Hadoop users have to describe their data in a simple data description
+language. The Hadoop DDL translator rcc generates code that users
+can invoke in order to read/write their data from/to simple stream
+abstractions. Next we list explicitly some of the goals and non-goals of
+Hadoop Record I/O.
+
+
+
Goals
+
+
+
Support for commonly used primitive types. Hadoop should include as
+primitives commonly used builtin types from programming languages we intend to
+support.
+
+
Support for common data compositions (including recursive compositions).
+Hadoop should support widely used composite types such as structs and
+vectors.
+
+
Code generation in multiple target languages. Hadoop should be capable of
+generating serialization code in multiple target languages and should be
+easily extensible to new target languages. The initial target languages are
+C++ and Java.
+
+
Support for generated target languages. Hadooop should include support
+in the form of headers, libraries, packages for supported target languages
+that enable easy inclusion and use of generated code in applications.
+
+
Support for multiple output encodings. Candidates include
+packed binary, comma-separated text, XML etc.
+
+
Support for specifying record types in a backwards/forwards compatible
+manner. This will probably be in the form of support for optional fields in
+records. This version of the document does not include a description of the
+planned mechanism, we intend to include it in the next iteration.
+
+
+
+
Non-Goals
+
+
+
Serializing existing arbitrary C++ classes.
+
Serializing complex data structures such as trees, linked lists etc.
+
Built-in indexing schemes, compression, or check-sums.
+
Dynamic construction of objects from an XML schema.
+
+
+The remainder of this document describes the features of Hadoop record I/O
+in more detail. Section 2 describes the data types supported by the system.
+Section 3 lays out the DDL syntax with some examples of simple records.
+Section 4 describes the process of code generation with rcc. Section 5
+describes target language mappings and support for Hadoop types. We include a
+fairly complete description of C++ mappings with intent to include Java and
+others in upcoming iterations of this document. The last section talks about
+supported output encodings.
+
+
+
Data Types and Streams
+
+This section describes the primitive and composite types supported by Hadoop.
+We aim to support a set of types that can be used to simply and efficiently
+express a wide range of record types in different programming languages.
+
+
Primitive Types
+
+For the most part, the primitive types of Hadoop map directly to primitive
+types in high level programming languages. Special cases are the
+ustring (a Unicode string) and buffer types, which we believe
+find wide use and which are usually implemented in library code and not
+available as language built-ins. Hadoop also supplies these via library code
+when a target language built-in is not present and there is no widely
+adopted "standard" implementation. The complete list of primitive types is:
+
+
+
byte: An 8-bit unsigned integer.
+
boolean: A boolean value.
+
int: A 32-bit signed integer.
+
long: A 64-bit signed integer.
+
float: A single precision floating point number as described by
+ IEEE-754.
+
double: A double precision floating point number as described by
+ IEEE-754.
+
ustring: A string consisting of Unicode characters.
+
buffer: An arbitrary sequence of bytes.
+
+
+
+
Composite Types
+Hadoop supports a small set of composite types that enable the description
+of simple aggregate types and containers. A composite type is serialized
+by sequentially serializing it constituent elements. The supported
+composite types are:
+
+
+
+
record: An aggregate type like a C-struct. This is a list of
+typed fields that are together considered a single unit of data. A record
+is serialized by sequentially serializing its constituent fields. In addition
+to serialization a record has comparison operations (equality and less-than)
+implemented for it, these are defined as memberwise comparisons.
+
+
vector: A sequence of entries of the same data type, primitive
+or composite.
+
+
map: An associative container mapping instances of a key type to
+instances of a value type. The key and value types may themselves be primitive
+or composite types.
+
+
+
+
Streams
+
+Hadoop generates code for serializing and deserializing record types to
+abstract streams. For each target language Hadoop defines very simple input
+and output stream interfaces. Application writers can usually develop
+concrete implementations of these by putting a one method wrapper around
+an existing stream implementation.
+
+
+
DDL Syntax and Examples
+
+We now describe the syntax of the Hadoop data description language. This is
+followed by a few examples of DDL usage.
+
+
+
+A DDL file describes one or more record types. It begins with zero or
+more include declarations, a single mandatory module declaration
+followed by zero or more class declarations. The semantics of each of
+these declarations are described below:
+
+
+
+
include: An include declaration specifies a DDL file to be
+referenced when generating code for types in the current DDL file. Record types
+in the current compilation unit may refer to types in all included files.
+File inclusion is recursive. An include does not trigger code
+generation for the referenced file.
+
+
module: Every Hadoop DDL file must have a single module
+declaration that follows the list of includes and precedes all record
+declarations. A module declaration identifies a scope within which
+the names of all types in the current file are visible. Module names are
+mapped to C++ namespaces, Java packages etc. in generated code.
+
+
class: Records types are specified through class
+declarations. A class declaration is like a Java class declaration.
+It specifies a named record type and a list of fields that constitute records
+of the type. Usage is illustrated in the following examples.
+
+
+
+
Examples
+
+
+
A simple DDL file links.jr with just one record declaration.
+
+module links {
+ class Link {
+ ustring URL;
+ boolean isRelative;
+ ustring anchorText;
+ };
+}
+
+
+The Hadoop translator is written in Java. Invocation is done by executing a
+wrapper shell script named named rcc. It takes a list of
+record description files as a mandatory argument and an
+optional language argument (the default is Java) --language or
+-l. Thus a typical invocation would look like:
+
+$ rcc -l C++ ...
+
+
+
+
Target Language Mappings and Support
+
+For all target languages, the unit of code generation is a record type.
+For each record type, Hadoop generates code for serialization and
+deserialization, record comparison and access to record members.
+
+
C++
+
+Support for including Hadoop generated C++ code in applications comes in the
+form of a header file recordio.hh which needs to be included in source
+that uses Hadoop types and a library librecordio.a which applications need
+to be linked with. The header declares the Hadoop C++ namespace which defines
+appropriate types for the various primitives, the basic interfaces for
+records and streams and enumerates the supported serialization encodings.
+Declarations of these interfaces and a description of their semantics follow:
+
+
RecFormat: An enumeration of the serialization encodings supported
+by this implementation of Hadoop.
+
+
InStream: A simple abstraction for an input stream. This has a
+single public read method that reads n bytes from the stream into
+the buffer buf. Has the same semantics as a blocking read system
+call. Returns the number of bytes read or -1 if an error occurs.
+
+
OutStream: A simple abstraction for an output stream. This has a
+single write method that writes n bytes to the stream from the
+buffer buf. Has the same semantics as a blocking write system
+call. Returns the number of bytes written or -1 if an error occurs.
+
+
RecordReader: A RecordReader reads records one at a time from
+an underlying stream in a specified record format. The reader is instantiated
+with a stream and a serialization format. It has a read method that
+takes an instance of a record and deserializes the record from the stream.
+
+
RecordWriter: A RecordWriter writes records one at a
+time to an underlying stream in a specified record format. The writer is
+instantiated with a stream and a serialization format. It has a
+write method that takes an instance of a record and serializes the
+record to the stream.
+
+
Record: The base class for all generated record types. This has two
+public methods type and signature that return the typename and the
+type signature of the record.
+
+
+
+Two files are generated for each record file (note: not for each record). If a
+record file is named "name.jr", the generated files are
+"name.jr.cc" and "name.jr.hh" containing serialization
+implementations and record type declarations respectively.
+
+For each record in the DDL file, the generated header file will contain a
+class definition corresponding to the record type, method definitions for the
+generated type will be present in the '.cc' file. The generated class will
+inherit from the abstract class hadoop::Record. The DDL files
+module declaration determines the namespace the record belongs to.
+Each '.' delimited token in the module declaration results in the
+creation of a namespace. For instance, the declaration module docs.links
+results in the creation of a docs namespace and a nested
+docs::links namespace. In the preceding examples, the Link class
+is placed in the links namespace. The header file corresponding to
+the links.jr file will contain:
+
+
+namespace links {
+ class Link : public hadoop::Record {
+ // ....
+ };
+};
+
+
+Each field within the record will cause the generation of a private member
+declaration of the appropriate type in the class declaration, and one or more
+acccessor methods. The generated class will implement the serialize and
+deserialize methods defined in hadoop::Record+. It will also
+implement the inspection methods type and signature from
+hadoop::Record. A default constructor and virtual destructor will also
+be generated. Serialization code will read/write records into streams that
+implement the hadoop::InStream and the hadoop::OutStream interfaces.
+
+For each member of a record an accessor method is generated that returns
+either the member or a reference to the member. For members that are returned
+by value, a setter method is also generated. This is true for primitive
+data members of the types byte, int, long, boolean, float and
+double. For example, for a int field called MyField the folowing
+code is generated.
+
+
+
+For a ustring or buffer or composite field. The generated code
+only contains accessors that return a reference to the field. A const
+and a non-const accessor are generated. For example:
+
+
+
+Code generation for Java is similar to that for C++. A Java class is generated
+for each record type with private members corresponding to the fields. Getters
+and setters for fields are also generated. Some differences arise in the
+way comparison is expressed and in the mapping of modules to packages and
+classes to files. For equality testing, an equals method is generated
+for each record type. As per Java requirements a hashCode method is also
+generated. For comparison a compareTo method is generated for each
+record type. This has the semantics as defined by the Java Comparable
+interface, that is, the method returns a negative integer, zero, or a positive
+integer as the invoked object is less than, equal to, or greater than the
+comparison parameter.
+
+A .java file is generated per record type as opposed to per DDL
+file as in C++. The module declaration translates to a Java
+package declaration. The module name maps to an identical Java package
+name. In addition to this mapping, the DDL compiler creates the appropriate
+directory hierarchy for the package and places the generated .java
+files in the correct directories.
+
+
Mapping Summary
+
+
+DDL Type C++ Type Java Type
+
+boolean bool boolean
+byte int8_t byte
+int int32_t int
+long int64_t long
+float float float
+double double double
+ustring std::string java.lang.String
+buffer std::string org.apache.hadoop.record.Buffer
+class type class type class type
+vector std::vector java.util.ArrayList
+map std::map java.util.TreeMap
+
+
+
Data encodings
+
+This section describes the format of the data encodings supported by Hadoop.
+Currently, three data encodings are supported, namely binary, CSV and XML.
+
+
Binary Serialization Format
+
+The binary data encoding format is fairly dense. Serialization of composite
+types is simply defined as a concatenation of serializations of the constituent
+elements (lengths are included in vectors and maps).
+
+Composite types are serialized as follows:
+
+
class: Sequence of serialized members.
+
vector: The number of elements serialized as an int. Followed by a
+sequence of serialized elements.
+
map: The number of key value pairs serialized as an int. Followed
+by a sequence of serialized (key,value) pairs.
+
+
+Serialization of primitives is more interesting, with a zero compression
+optimization for integral types and normalization to UTF-8 for strings.
+Primitive types are serialized as follows:
+
+
+
byte: Represented by 1 byte, as is.
+
boolean: Represented by 1-byte (0 or 1)
+
int/long: Integers and longs are serialized zero compressed.
+Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a
+sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents
+the number of trailing bytes, N, as the negative number (-120-N). For example,
+the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'.
+This doesn't help much for 4-byte integers but does a reasonably good job with
+longs without bit twiddling.
+
float/double: Serialized in IEEE 754 single and double precision
+format in network byte order. This is the format used by Java.
+
ustring: Serialized as 4-byte zero compressed length followed by
+data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native
+language representation.
+
buffer: Serialized as a 4-byte zero compressed length followed by the
+raw bytes in the buffer.
+
+
+
+
CSV Serialization Format
+
+The CSV serialization format has a lot more structure than the "standard"
+Excel CSV format, but we believe the additional structure is useful because
+
+
+
it makes parsing a lot easier without detracting too much from legibility
+
the delimiters around composites make it obvious when one is reading a
+sequence of Hadoop records
+
+
+Serialization formats for the various types are detailed in the grammar that
+follows. The notable feature of the formats is the use of delimiters for
+indicating the certain field types.
+
+
+
A string field begins with a single quote (').
+
A buffer field begins with a sharp (#).
+
A class, vector or map begins with 's{', 'v{' or 'm{' respectively and
+ends with '}'.
+
+
+The CSV format can be described by the following grammar:
+
+
+
+The XML serialization format is the same used by Apache XML-RPC
+(http://ws.apache.org/xmlrpc/types.html). This is an extension of the original
+XML-RPC format and adds some additional data types. All record I/O types are
+not directly expressible in this format, and access to a DDL is required in
+order to convert these to valid types. All types primitive or composite are
+represented by <value> elements. The particular XML-RPC type is
+indicated by a nested element in the <value> element. The encoding for
+records is always UTF-8. Primitive types are serialized as follows:
+
+
+
byte: XML tag <ex:i1>. Values: 1-byte unsigned
+integers represented in US-ASCII
+
boolean: XML tag <boolean>. Values: "0" or "1"
+
int: XML tags <i4> or <int>. Values: 4-byte
+signed integers represented in US-ASCII.
+
long: XML tag <ex:i8>. Values: 8-byte signed integers
+represented in US-ASCII.
+
float: XML tag <ex:float>. Values: Single precision
+floating point numbers represented in US-ASCII.
+
double: XML tag <double>. Values: Double precision
+floating point numbers represented in US-ASCII.
+
ustring: XML tag <;string>. Values: String values
+represented as UTF-8. XML does not permit all Unicode characters in literal
+data. In particular, NULLs and control chars are not allowed. Additionally,
+XML processors are required to replace carriage returns with line feeds and to
+replace CRLF sequences with line feeds. Programming languages that we work
+with do not impose these restrictions on string types. To work around these
+restrictions, disallowed characters and CRs are percent escaped in strings.
+The '%' character is also percent escaped.
+
buffer: XML tag <string&>. Values: Arbitrary binary
+data. Represented as hexBinary, each byte is replaced by its 2-byte
+hexadecimal representation.
+
+
+Composite types are serialized as follows:
+
+
+
class: XML tag <struct>. A struct is a sequence of
+<member> elements. Each <member> element has a <name>
+element and a <value> element. The <name> is a string that must
+match /[a-zA-Z][a-zA-Z0-9_]*/. The value of the member is represented
+by a <value> element.
+
+
vector: XML tag <array<. An <array> contains a
+single <data> element. The <data> element is a sequence of
+<value> elements each of which represents an element of the vector.
+
+
map: XML tag <array>. Same as vector.
+
+
+
+For example:
+
+
+class {
+ int MY_INT; // value 5
+ vector MY_VEC; // values 0.1, -0.89, 2.45e4
+ buffer MY_BUF; // value '\00\n\tabc%'
+}
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param job job configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.18.1.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.1.xml
new file mode 100644
index 0000000..fd844cb
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.1.xml
@@ -0,0 +1,44778 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Hadoop by default specifies two resources, loaded in-order from the
+ classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The balancer is a tool that balances disk space usage on an HDFS cluster
+ when some datanodes become full or when new empty nodes join the cluster.
+ The tool is deployed as an application program that can be run by the
+ cluster administrator on a live HDFS cluster while applications
+ adding and deleting files.
+
+
SYNOPSIS
+
+ To start:
+ bin/start-balancer.sh [-threshold ]
+ Example: bin/ start-balancer.sh
+ start the balancer with a default threshold of 10%
+ bin/ start-balancer.sh -threshold 5
+ start the balancer with a threshold of 5%
+ To stop:
+ bin/ stop-balancer.sh
+
+
+
DESCRIPTION
+
The threshold parameter is a fraction in the range of (0%, 100%) with a
+ default value of 10%. The threshold sets a target for whether the cluster
+ is balanced. A cluster is balanced if for each datanode, the utilization
+ of the node (ratio of used space at the node to total capacity of the node)
+ differs from the utilization of the (ratio of used space in the cluster
+ to total capacity of the cluster) by no more than the threshold value.
+ The smaller the threshold, the more balanced a cluster will become.
+ It takes more time to run the balancer for small threshold values.
+ Also for a very small threshold the cluster may not be able to reach the
+ balanced state when applications write and delete files concurrently.
+
+
The tool moves blocks from highly utilized datanodes to poorly
+ utilized datanodes iteratively. In each iteration a datanode moves or
+ receives no more than the lesser of 10G bytes or the threshold fraction
+ of its capacity. Each iteration runs no more than 20 minutes.
+ At the end of each iteration, the balancer obtains updated datanodes
+ information from the namenode.
+
+
A system property that limits the balancer's use of bandwidth is
+ defined in the default configuration file:
+
+
+ dfs.balance.bandwidthPerSec
+ 1048576
+ Specifies the maximum bandwidth that each datanode
+ can utilize for the balancing purpose in term of the number of bytes
+ per second.
+
+
+
+
This property determines the maximum speed at which a block will be
+ moved from one datanode to another. The default value is 1MB/s. The higher
+ the bandwidth, the faster a cluster can reach the balanced state,
+ but with greater competition with application processes. If an
+ administrator changes the value of this property in the configuration
+ file, the change is observed when HDFS is next restarted.
+
+
MONITERING BALANCER PROGRESS
+
After the balancer is started, an output file name where the balancer
+ progress will be recorded is printed on the screen. The administrator
+ can monitor the running of the balancer by reading the output file.
+ The output shows the balancer's status iteration by iteration. In each
+ iteration it prints the starting time, the iteration number, the total
+ number of bytes that have been moved in the previous iterations,
+ the total number of bytes that are left to move in order for the cluster
+ to be balanced, and the number of bytes that are being moved in this
+ iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left
+ To Move" is decreasing.
+
+
Running multiple instances of the balancer in an HDFS cluster is
+ prohibited by the tool.
+
+
The balancer automatically exits when any of the following five
+ conditions is satisfied:
+
+
The cluster is balanced;
+
No block can be moved;
+
No block has been moved for five consecutive iterations;
+
An IOException occurs while communicating with the namenode;
+
Another balancer is running.
+
+
+
Upon exit, a balancer returns an exit code and prints one of the
+ following messages to the output file in corresponding to the above exit
+ reasons:
+
+
The cluster is balanced. Exiting
+
No block can be moved. Exiting...
+
No block has been moved for 3 iterations. Exiting...
+
Received an IO exception: failure reason. Exiting...
+
Another balancer is running. Exiting...
+
+
+
The administrator can interrupt the execution of the balancer at any
+ time by running the command "stop-balancer.sh" on the machine where the
+ balancer is running.]]>
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link NamenodeFsck#FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link NamenodeFsck#FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
{@link FSConstants.StartupOption#REGULAR REGULAR} - normal startup
+
{@link FSConstants.StartupOption#FORMAT FORMAT} - format name node
+
{@link FSConstants.StartupOption#UPGRADE UPGRADE} - start the cluster
+ upgrade and create a snapshot of the current file system state
+
{@link FSConstants.StartupOption#ROLLBACK ROLLBACK} - roll the
+ cluster back to the previous state
+
+ The option is passed via configuration field:
+ dfs.namenode.startup
+
+ The conf will be modified to reflect the actual ports on which
+ the NameNode is up and running if the user passes the port as
+ zero in the conf.
+
+ @param conf confirguration
+ @throws IOException]]>
+
+
+
+
+
+ zero.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ datanode whose
+ total size is size
+
+ @param datanode on which blocks are located
+ @param size total size of blocks]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ blocksequence (namespace)
+ 2) block->machinelist ("inodes")
+
+ The first table is stored on disk and is very precious.
+ The second table is rebuilt every time the NameNode comes
+ up.
+
+ 'NameNode' refers to both this class as well as the 'NameNode server'.
+ The 'FSNamesystem' class actually performs most of the filesystem
+ management. The majority of the 'NameNode' class itself is concerned
+ with exposing the IPC interface to the outside world, plus some
+ configuration management.
+
+ NameNode implements the ClientProtocol interface, which allows
+ clients to ask for DFS services. ClientProtocol is not
+ designed for direct use by authors of DFS client code. End-users
+ should instead use the org.apache.nutch.hadoop.fs.FileSystem class.
+
+ NameNode also implements the DatanodeProtocol interface, used by
+ DataNode programs that actually store DFS data blocks. These
+ methods are invoked repeatedly and automatically by all the
+ DataNodes in a DFS deployment.
+
+ NameNode also implements the NamenodeProtocol interface, used by
+ secondary namenodes or rebalancing processes to get partial namenode's
+ state, for example partial blocksMap etc.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link #FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link #FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
delete corrupted files ({@link #FIXING_DELETE})
+
+
+
detect files with under-replicated or over-replicated blocks
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link JobConf}. The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link Mapper} or {@link Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is {@link DistributedFileSystem}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This method is used to validate the input directories when a job is
+ submitted so that the {@link JobClient} can fail early, with an useful
+ error message, in case of errors. For e.g. input directory does not exist.
+
+
+ @param job job configuration.
+ @throws InvalidInputException if the job does not have valid input
+ @deprecated getSplits is called in the client and can perform any
+ necessary validation of the input]]>
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ .
+ @param name The name of the server
+ @param port The port to use on the server
+ @param findPort whether the server should start at the given port and
+ increment by 1 until it finds a free port.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ points to the log directory
+ "/static/" -> points to common static files (src/webapps/static)
+ "/" -> the jsp server code from (src/webapps/)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ pairs. Uses
+ {@link StringTokenizer} to break text into tokens.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ generateKeyValPairs(Object key, Object value); public void
+ configure(JobConfjob); }
+
+ The package also provides a base class, ValueAggregatorBaseDescriptor,
+ implementing the above interface. The user can extend the base class and
+ implement generateKeyValPairs accordingly.
+
+ The primary work of generateKeyValPairs is to emit one or more key/value
+ pairs based on the input key/value pair. The key in an output key/value pair
+ encode two pieces of information: aggregation type and aggregation id. The
+ value will be aggregated onto the aggregation id according the aggregation
+ type.
+
+ This class offers a function to generate a map/reduce job using Aggregate
+ framework. The function takes the following parameters: input directory spec
+ input format (text or sequence file) output directory a file specifying the
+ user plugin class]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param job job configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.18.2.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.2.xml
new file mode 100644
index 0000000..08173ab
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.2.xml
@@ -0,0 +1,38788 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Hadoop by default specifies two resources, loaded in-order from the
+ classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link JobConf}. The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link Mapper} or {@link Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is {@link DistributedFileSystem}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This method is used to validate the input directories when a job is
+ submitted so that the {@link JobClient} can fail early, with an useful
+ error message, in case of errors. For e.g. input directory does not exist.
+
+
+ @param job job configuration.
+ @throws InvalidInputException if the job does not have valid input
+ @deprecated getSplits is called in the client and can perform any
+ necessary validation of the input]]>
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ .
+ @param name The name of the server
+ @param port The port to use on the server
+ @param findPort whether the server should start at the given port and
+ increment by 1 until it finds a free port.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ points to the log directory
+ "/static/" -> points to common static files (src/webapps/static)
+ "/" -> the jsp server code from (src/webapps/)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ pairs. Uses
+ {@link StringTokenizer} to break text into tokens.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ generateKeyValPairs(Object key, Object value); public void
+ configure(JobConfjob); }
+
+ The package also provides a base class, ValueAggregatorBaseDescriptor,
+ implementing the above interface. The user can extend the base class and
+ implement generateKeyValPairs accordingly.
+
+ The primary work of generateKeyValPairs is to emit one or more key/value
+ pairs based on the input key/value pair. The key in an output key/value pair
+ encode two pieces of information: aggregation type and aggregation id. The
+ value will be aggregated onto the aggregation id according the aggregation
+ type.
+
+ This class offers a function to generate a map/reduce job using Aggregate
+ framework. The function takes the following parameters: input directory spec
+ input format (text or sequence file) output directory a file specifying the
+ user plugin class]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param job job configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.18.3.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.3.xml
new file mode 100644
index 0000000..564916f
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.18.3.xml
@@ -0,0 +1,38826 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Hadoop by default specifies two resources, loaded in-order from the
+ classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link JobConf}. The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link Mapper} or {@link Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is {@link DistributedFileSystem}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This method is used to validate the input directories when a job is
+ submitted so that the {@link JobClient} can fail early, with an useful
+ error message, in case of errors. For e.g. input directory does not exist.
+
+
+ @param job job configuration.
+ @throws InvalidInputException if the job does not have valid input
+ @deprecated getSplits is called in the client and can perform any
+ necessary validation of the input]]>
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ .
+ @param name The name of the server
+ @param port The port to use on the server
+ @param findPort whether the server should start at the given port and
+ increment by 1 until it finds a free port.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ points to the log directory
+ "/static/" -> points to common static files (src/webapps/static)
+ "/" -> the jsp server code from (src/webapps/)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ pairs. Uses
+ {@link StringTokenizer} to break text into tokens.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ generateKeyValPairs(Object key, Object value); public void
+ configure(JobConfjob); }
+
+ The package also provides a base class, ValueAggregatorBaseDescriptor,
+ implementing the above interface. The user can extend the base class and
+ implement generateKeyValPairs accordingly.
+
+ The primary work of generateKeyValPairs is to emit one or more key/value
+ pairs based on the input key/value pair. The key in an output key/value pair
+ encode two pieces of information: aggregation type and aggregation id. The
+ value will be aggregated onto the aggregation id according the aggregation
+ type.
+
+ This class offers a function to generate a map/reduce job using Aggregate
+ framework. The function takes the following parameters: input directory spec
+ input format (text or sequence file) output directory a file specifying the
+ user plugin class]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param job job configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.19.0.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.0.xml
new file mode 100644
index 0000000..557ac3c
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.0.xml
@@ -0,0 +1,43972 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair compatible with lzop.
+ http://www.lzop.org/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
+
+
+ Instances of this class are not threadsafe.
+
+
+
+ TODO: Update to BZip2 1.0.1
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.19.1.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.1.xml
new file mode 100644
index 0000000..92bdd2c
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.1.xml
@@ -0,0 +1,44195 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair compatible with lzop.
+ http://www.lzop.org/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
+
+
+ Instances of this class are not threadsafe.
+
+
+
+ TODO: Update to BZip2 1.0.1
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.19.2.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.2.xml
new file mode 100644
index 0000000..bbce108
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.19.2.xml
@@ -0,0 +1,44204 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
hadoop-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-lzo library is loaded & initialized;
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair.
+ http://www.oberhumer.com/opensource/lzo/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ lzo compression/decompression pair compatible with lzop.
+ http://www.lzop.org/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
+
+
+ Instances of this class are not threadsafe.
+
+
+
+ TODO: Update to BZip2 1.0.1
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo compressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if lzo decompressors are loaded & initialized,
+ else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is a task-level aggregation operation which, in some cases,
+ helps to cut down the amount of data transferred from the {@link Mapper} to
+ the {@link Reducer}, leading to better performance.
+
+
Typically the combiner is same as the the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see JobTracker#getNewJobId()
+ @see JobTracker#getStartTime()]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("mapred.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.20.0.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.0.xml
new file mode 100644
index 0000000..ce6f91b
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.0.xml
@@ -0,0 +1,52140 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a float.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
+
+
+ Instances of this class are not threadsafe.
+
+
+
+ TODO: Update to BZip2 1.0.1
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @return the total (non-negative) number of uncompressed bytes input so far]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-zlib is loaded & initialized
+ and can be loaded for this job, else false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address with the ticket credentials, returning
+ the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address which is servicing the protocol protocol,
+ with the ticket credentials, returning the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name=RpcActivityForPort"
+
+ Many of the activity metrics are sampled and averaged on an interval
+ which can be specified in the metrics config file.
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see #getOutputStream(Socket, long)
+
+ @param socket
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ socket.connect(endpoint, timeout). If
+ socket.getChannel() returns a non-null channel,
+ connect is implemented using Hadoop's selectors. This is done mainly
+ to avoid Sun's connect implementation from creating thread-local
+ selectors, since Hadoop does not have control on when these are closed
+ and could end up taking all the available file descriptors.
+
+ @see java.net.Socket#connect(java.net.SocketAddress, int)
+
+ @param socket
+ @param endpoint
+ @param timeout - timeout in milliseconds]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ node
+
+ @param node
+ a node
+ @return true if node is already in the tree; false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ scope
+ if scope starts with ~, choose one from the all nodes except for the
+ ones in scope; otherwise, choose one from scope
+ @param scope range of nodes from which a node will be choosen
+ @return the choosen node]]>
+
+
+
+
+
+
+ scope but not in excludedNodes
+ if scope starts with ~, return the number of nodes that are not
+ in scope and excludedNodes;
+ @param scope a path string that may start with ~
+ @param excludedNodes a list of nodes
+ @return number of available nodes]]>
+
+
+
+
+
+
+
+
+
+
+
+ reader
+ It linearly scans the array, if a local node is found, swap it with
+ the first element of the array.
+ If a local rack node is found, swap it with the first element following
+ the local node.
+ If neither local node or local rack node is found, put a random replica
+ location at postion 0.
+ It leaves the rest nodes untouched.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new ouput stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = getCount().
+ @param newCapacity The new capacity in bytes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
]]>
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Group with the given groupname.
+ @param group group name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi.
+ @param ugi user
+ @return the {@link Subject} for the user identified by ugi]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi as a comma separated string in
+ conf as a property attr
+
+ The String starts with the user name followed by the default group names,
+ and other group names.
+
+ @param conf configuration
+ @param attr property name
+ @param ugi a UnixUserGroupInformation]]>
+
+
+
+
+
+
+
+ conf
+
+ The object is expected to store with the property name attr
+ as a comma separated string that starts
+ with the user name followed by group names.
+ If the property name is not defined, return null.
+ It's assumed that there is only one UGI per user. If this user already
+ has a UGI in the ugi map, return the ugi in the map.
+ Otherwise, construct a UGI from the configuration, store it in the
+ ugi map and return it.
+
+ @param conf configuration
+ @param attr property name
+ @return a UnixUGI
+ @throws LoginException if the stored string is ill-formatted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ User with the given username.
+ @param user user name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ does not provide the stack trace for security purposes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ service as related to
+ Service Level Authorization for Hadoop.
+
+ Each service defines it's configuration key and also the necessary
+ {@link Permission} required to access the service.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash functions to consider.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+
+ this filter.
+ @param key The key to add.]]>
+
+
+
+
+
+ this filter.
+ @param key The key to test.
+ @return boolean True if the specified key belongs to this filter.
+ False otherwise.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to AND with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to OR with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to XOR with.]]>
+
+
+
+
+ this filter.
+
+ The result is assigned to this filter.]]>
+
+
+
+
+
+ this filter.
+ @param keys The list of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The collection of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The array of keys.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A filter is a data structure which aims at offering a lossy summary of a set A. The
+ key idea is to map entries of A (also called keys) into several positions
+ in a vector through the use of several hash functions.
+
+ Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).
+
+ It must be extended in order to define the real behavior.
+
+ @see Key The general behavior of a key
+ @see HashFunction A hash function]]>
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds a key with a default weight.
+ @param value The byte value of this key.]]>
+
+
+
+
+
+ Builds a key with a specified weight.
+ @param value The value of this key.
+ @param weight The weight associated to this key.]]>
+
+
+
+
+
+
+
+
+
+
+
+ this key.]]>
+
+
+
+
+ this key.]]>
+
+
+
+
+
+ this key with a specified value.
+ @param weight The increment.]]>
+
+
+
+
+ this key by one.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ length, and
+ the provided seed value
+ @param bytes input bytes
+ @param length length of the valid bytes to consider
+ @param initval seed value
+ @return hash value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The best hash table sizes are powers of 2. There is no need to do mod
+ a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.
+ For example, if you need only 10 bits, do
+ h = (h & hashmask(10));
+ In which case, the hash table should have hashsize(10) elements.
+
+
If you are hashing n strings byte[][] k, do it like this:
+ for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);
+
+
By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this
+ code any way you wish, private, educational, or commercial. It's free.
+
+
Use for hash table lookup, or anything where one collision in 2^^32 is
+ acceptable. Do NOT use for cryptographic purposes.]]>
+
+
+
+
+
+
+
+
+
+
+ lookup3.c, by Bob Jenkins, May 2006, Public Domain.
+
+ You can use this free for any purpose. It's in the public domain.
+ It has no warranty.
+
+
+ @see lookup3.c
+ @see Hash Functions (and how this
+ function compares to others such as CRC, MD?, etc
+ @see Has update on the
+ Dr. Dobbs Article]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The C version of MurmurHash 2.0 found at that site was ported
+ to Java by Andrzej Bialecki (ab at getopt org).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker,
+ as {@link JobTracker.State}
+
+ @return the current state of the JobTracker.]]>
+
+
+
+
+ JobTracker
+
+ @return the size of heap memory used by the JobTracker]]>
+
+
+
+
+ JobTracker
+
+ @return the configured size of max heap memory that can be used by the JobTracker]]>
+
+
+
+
+
+
+
+
+
+
+
+ ClusterStatus provides clients with information such as:
+
+
+ Size of the cluster.
+
+
+ Name of the trackers.
+
+
+ Task capacity of the cluster.
+
+
+ The number of currently running map & reduce tasks.
+
+
+ State of the JobTracker.
+
+
+
+
Clients can query for the latest ClusterStatus, via
+ {@link JobClient#getClusterStatus()}.
Counters are bunched into {@link Group}s, each comprising of
+ counters from a particular Enum class.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Counters} instead.]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.lib.input.FileInputFormat}
+ instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputSplit} instead.]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is an application-specified aggregation operation, which
+ can help cut down the amount of data transferred between the
+ {@link Mapper} and the {@link Reducer}, leading to better performance.
+
+
The framework may invoke the combiner 0, 1, or multiple times, in both
+ the mapper and reducer tasks. In general, the combiner is called as the
+ sort/merge result is written to disk. The combiner must:
+
+
be side-effect free
+
have the same input and output key types and the same input and
+ output value types
+
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If a job doesn't specify its virtual memory requirement by setting
+ {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to {@link #DISABLED_MEMORY_LIMIT},
+ tasks are assured a memory limit set to this property. This property is
+ disabled by default, and if not explicitly set to a valid value by the
+ administrators and if a job doesn't specify its virtual memory
+ requirements, the job's tasks will not be assured anything and may be
+ killed by a TT that intends to control the total memory usage of the tasks
+ via memory management functionality.
+
+
+
+ This value should in general be less than the cluster-wide configuration
+ {@link #UPPER_LIMIT_ON_TASK_VMEM_PROPERTY} . If not or if it not set,
+ TaskTracker's memory management may be disabled and a scheduler's memory
+ based scheduling decisions will be affected. Please refer to the
+ documentation of the configured scheduler to see how this property is used.]]>
+
+
+
+
+
+
+ This value will be used by TaskTrackers for monitoring the memory usage of
+ tasks of this jobs. If a TaskTracker's memory management functionality is
+ enabled, each task of this job will be allowed to use a maximum virtual
+ memory specified by this property. If the task's memory usage goes over
+ this value, the task will be failed by the TT. If not set, the cluster-wide
+ configuration {@link #MAPRED_TASK_DEFAULT_MAXVMEM_PROPERTY} is used as the
+ default value for memory requirements. If this property cascaded with
+ {@link #MAPRED_TASK_DEFAULT_MAXVMEM_PROPERTY} becomes equal to -1, job's
+ tasks will not be assured anything and may be killed by a TT that intends
+ to control the total memory usage of the tasks via memory management
+ functionality. If the memory management functionality is disabled on a TT,
+ this value is ignored.
+
+
+
+ This value should also be not more than the cluster-wide configuration
+ {@link #UPPER_LIMIT_ON_TASK_VMEM_PROPERTY} which has to be set by the site
+ administrators.
+
+
+
+ This value may be used by schedulers that support scheduling based on job's
+ memory requirements. In general, a task of this job will be scheduled on a
+ TaskTracker only if the amount of virtual memory still unoccupied on the
+ TaskTracker is greater than or equal to this value. But different
+ schedulers can take different decisions. Please refer to the documentation
+ of the scheduler being configured to see if it does memory based scheduling
+ and if it does, how this property is used by that scheduler.
+
+ @see #setMaxVirtualMemoryForTask(long)
+ @see #getMaxVirtualMemoryForTask()]]>
+
+
+
+
+
+
+ This value may be used by schedulers that support scheduling based on job's
+ memory requirements. In general, a task of this job will be scheduled on a
+ TaskTracker, only if the amount of physical memory still unoccupied on the
+ TaskTracker is greater than or equal to this value. But different
+ schedulers can take different decisions. Please refer to the documentation
+ of the scheduler being configured to see how it does memory based
+ scheduling and how this variable is used by that scheduler.
+
+ @see #setMaxPhysicalMemoryForTask(long)
+ @see #getMaxPhysicalMemoryForTask()]]>
+
+
+
+
+
+
+ If it is not set on a TaskTracker, TaskTracker's memory management will be
+ disabled.]]>
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("map.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Mapper} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileInputFormat} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileSplit} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputCommitter} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if there exists a key/value,
+ false otherwise.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RawKeyValueIterator is an iterator used to iterate over
+ the raw keys and values during sort/merge of intermediate data.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Reducer} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordReader's for CombineFileSplit's.
+ @see CombineFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CombineFileSplit can be used to implement {@link org.apache.hadoop.mapred.RecordReader}'s,
+ with reading one record per file.
+ @see org.apache.hadoop.mapred.FileSplit
+ @see CombineFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple. The InputFormat
+ also creates the {@link RecordReader} to read the {@link InputSplit}.
+
+ @param context job configuration.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibility to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ OutputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Mapper to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Partitioner to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker is lost]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1.
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see org.apache.hadoop.mapred.JobTracker#getNewJobId()
+ @see org.apache.hadoop.mapred.JobTracker#getStartTime()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the key input type to the Mapper
+ @param the value input type to the Mapper
+ @param the key output type from the Mapper
+ @param the value output type from the Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link Configuration} for
+ the job via the {@link JobContext#getConfiguration()}.
+
+
The framework first calls
+ {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
+ {@link #map(Object, Object, Context)}
+ for each key/value pair in the InputSplit. Finally
+ {@link #cleanup(Context)} is called.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the sorting and grouping by
+ specifying two key {@link RawComparator} classes.
+
+
The Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the Configuration.
+
+
If the job has zero
+ reduces then the output of the Mapper is directly written
+ to the {@link OutputFormat} without sorting by keys.
+
+
Example:
+
+ public class TokenCounterMapper
+ extends Mapper
+
+
Applications may override the {@link #run(Context)} method to exert
+ greater control on map processing e.g. multi-threaded Mappers
+ etc.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param context information about the job
+ @throws IOException when output should not be attempted]]>
+
+
+
+
+
+
+
+
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be partioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordWriter to future operations.
+
+ @param context the context of the task
+ @throws IOException]]>
+
+
+
+ RecordWriter writes the output <key, value> pairs
+ to an output file.
+
+
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the input keys
+ @param the class of the input values
+ @param the class of the output keys
+ @param the class of the output values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer implementations
+ can access the {@link Configuration} for the job via the
+ {@link JobContext#getConfiguration()} method.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
The Reducer copies the sorted output from each
+ {@link Mapper} using HTTP across the network.
+
+
+
+
Sort
+
+
The framework merge sorts Reducer inputs by
+ keys
+ (since different Mappers may have output the same key).
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
To achieve a secondary sort on the values returned by the value
+ iterator, the application should extend the key with the secondary
+ key and define a grouping comparator. The keys will be sorted using the
+ entire key, but will be grouped using the grouping comparator to decide
+ which keys and values are sent in the same call to reduce.The grouping
+ comparator is specified via
+ {@link Job#setGroupingComparatorClass(Class)}. The sort order is
+ controlled by
+ {@link Job#setSortComparatorClass(Class)}.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterable, Context)}
+ method is called for each <key, (collection of values)> in
+ the sorted inputs.
+
The output of the reduce task is typically written to a
+ {@link RecordWriter} via
+ {@link Context#write(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class IntSumReducer extends Reducer {
+ private IntWritable result = new IntWritable();
+
+ public void reduce(Key key, Iterable values,
+ Context context) throws IOException {
+ int sum = 0;
+ for (IntWritable val : values) {
+ sum += val.get();
+ }
+ result.set(sum);
+ context.collect(key, result);
+ }
+ }
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the input key type for the task
+ @param the input value type for the task
+ @param the output key type for the task
+ @param the output value type for the task]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param context the job context
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobContext)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(JobContext, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the map's input key type
+ @param the map's input value type
+ @param the map's output key type
+ @param the map's output value type
+ @param job the job
+ @return the mapper class to run]]>
+
+
+
+
+
+
+ the map input key type
+ @param the map input value type
+ @param the map output key type
+ @param the map output value type
+ @param job the job to modify
+ @param cls the class to use as the mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Mapper implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured with the mapper to use via
+ {@link #setMapperClass(Configuration, Class)} and
+ the number of thread the thread-pool can use with the
+ {@link #getNumberOfThreads(Configuration) method. The default
+ value is 10 threads.
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in a work directory during execution
+ of his task i.e. via
+ {@link #getWorkOutputPath(TaskInputOutputContext)}, and
+ the framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueFile} method to make the file name
+ unique for the task.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.20.1.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.1.xml
new file mode 100644
index 0000000..fc05639
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.1.xml
@@ -0,0 +1,53832 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a float.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
Seek by key or by file offset.
+
+ The memory footprint of a TFile includes the following:
+
+
Some constant overhead of reading or writing a compressed block.
+
+
Each compressed block requires one compression/decompression codec for
+ I/O.
+
Temporary space to buffer the key.
+
Temporary space to buffer the value (for TFile.Writer only). Values are
+ chunk encoded, so that we buffer at most one chunk of user data. By default,
+ the chunk buffer is 1MB. Reading chunked value does not require additional
+ memory.
+
+
TFile index, which is proportional to the total number of Data Blocks.
+ The total amount of memory needed to hold the index can be estimated as
+ (56+AvgKeySize)*NumBlocks.
+
MetaBlock index, which is proportional to the total number of Meta
+ Blocks.The total amount of memory needed to hold the index for Meta Blocks
+ can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.
+
+
+ The behavior of TFile can be customized by the following variables through
+ Configuration:
+
+
tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default
+ to 1MB. Values of the length less than the chunk size is guaranteed to have
+ known value length in read time (See
+ {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
+
tfile.fs.output.buffer.size: Buffer size used for
+ FSDataOutputStream. Integer (in bytes). Default to 256KB.
+
tfile.fs.input.buffer.size: Buffer size used for
+ FSDataInputStream. Integer (in bytes). Default to 256KB.
+
+
+ Suggestions on performance optimization.
+
+
Minimum block size. We recommend a setting of minimum block size between
+ 256KB to 1MB for general usage. Larger block size is preferred if files are
+ primarily for sequential access. However, it would lead to inefficient random
+ access (because there are more data to decompress). Smaller blocks are good
+ for random access, but require more memory to hold the block index, and may
+ be slower to create (because we must flush the compressor stream at the
+ conclusion of each data block, which leads to an FS I/O flush). Further, due
+ to the internal caching in Compression codec, the smallest possible block
+ size would be around 20KB-30KB.
+
The current implementation does not offer true multi-threading for
+ reading. The implementation uses FSDataInputStream seek()+read(), which is
+ shown to be much faster than positioned-read call in single thread mode.
+ However, it also means that if multiple threads attempt to access the same
+ TFile (using multiple scanners) simultaneously, the actual I/O is carried out
+ sequentially even if they access different DFS blocks.
+
Compression codec. Use "none" if the data is not very compressable (by
+ compressable, I mean a compression ratio at least 2:1). Generally, use "lzo"
+ as the starting point for experimenting. "gz" overs slightly better
+ compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to
+ decompress, comparing to "lzo".
+
File system buffering, if the underlying FSDataInputStream and
+ FSDataOutputStream is already adequately buffered; or if applications
+ reads/writes keys and values in large buffers, we can reduce the sizes of
+ input/output buffering in TFile layer by setting the configuration parameters
+ "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
+
+
+ Some design rationale behind TFile can be found at Hadoop-3315.]]>
+
+ Use {@link Scanner#advance()} to move the cursor to the next key-value
+ pair (or end if none exists). Use seekTo methods (
+ {@link Scanner#seekTo(byte[])} or
+ {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary
+ location in the covered range (including backward seeking). Use
+ {@link Scanner#rewind()} to seek back to the beginning of the scanner.
+ Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.
+
+ Actual keys and values may be obtained through {@link Scanner.Entry}
+ object, which is obtained through {@link Scanner#entry()}.]]>
+
Algorithmic comparator: binary comparators that is language
+ independent. Currently, only "memcmp" is supported.
+
Language-specific comparator: binary comparators that can
+ only be constructed in specific language. For Java, the syntax
+ is "jclass:", followed by the class name of the RawComparator.
+ Currently, we only support RawComparators that can be
+ constructed through the default constructor (with no
+ parameters). Parameterized RawComparators such as
+ {@link WritableComparator} or
+ {@link JavaSerializationComparator} may not be directly used.
+ One should write a wrapper class that inherits from such classes
+ and use its default constructor to perform proper
+ initialization.
+
+ @param conf
+ The configuration object.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If an exception is thrown, the TFile will be in an inconsistent
+ state. The only legitimate call after that would be close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Utils#writeVLong(out, n).
+
+ @param out
+ output stream
+ @param n
+ The integer to be encoded
+ @throws IOException
+ @see Utils#writeVLong(DataOutput, long)]]>
+
+
+
+
+
+
+
+
+
if n in [-32, 127): encode in one byte with the actual value.
+ Otherwise,
+
if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52;
+ byte[1]=n&0xff. Otherwise,
+
if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 -
+ 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
+
if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112;
+ byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
+
if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] =
+ (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
+
if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] =
+ (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff;
+ byte[4]=(n>>8)&0xff; byte[5]=n&0xff
+
if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] =
+ (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff;
+ byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
+
if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] =
+ (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff;
+ byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff;
+ byte[7]=n&0xff;
+
if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] =
+ (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff;
+ byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff;
+ byte[7]=(n>>8)&0xff; byte[8]=n&0xff;
+
+
+ @param out
+ output stream
+ @param n
+ the integer number
+ @throws IOException]]>
+
if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
+
if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 +
+ NB[1]&0xff;
+
if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 +
+ (NB[1]&0xff)<<8 + NB[2]&0xff;
+
if (FB in [-128, -121]), return interpret NB[FB+129] as a signed
+ big-endian integer.
+
+ @param in
+ input stream
+ @return the decoded long integer.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address with the ticket credentials, returning
+ the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address which is servicing the protocol protocol,
+ with the ticket credentials, returning the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name=RpcActivityForPort"
+
+ Many of the activity metrics are sampled and averaged on an interval
+ which can be specified in the metrics config file.
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see #getOutputStream(Socket, long)
+
+ @param socket
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ socket.connect(endpoint, timeout). If
+ socket.getChannel() returns a non-null channel,
+ connect is implemented using Hadoop's selectors. This is done mainly
+ to avoid Sun's connect implementation from creating thread-local
+ selectors, since Hadoop does not have control on when these are closed
+ and could end up taking all the available file descriptors.
+
+ @see java.net.Socket#connect(java.net.SocketAddress, int)
+
+ @param socket
+ @param endpoint
+ @param timeout - timeout in milliseconds]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ node
+
+ @param node
+ a node
+ @return true if node is already in the tree; false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ scope
+ if scope starts with ~, choose one from the all nodes except for the
+ ones in scope; otherwise, choose one from scope
+ @param scope range of nodes from which a node will be choosen
+ @return the choosen node]]>
+
+
+
+
+
+
+ scope but not in excludedNodes
+ if scope starts with ~, return the number of nodes that are not
+ in scope and excludedNodes;
+ @param scope a path string that may start with ~
+ @param excludedNodes a list of nodes
+ @return number of available nodes]]>
+
+
+
+
+
+
+
+
+
+
+
+ reader
+ It linearly scans the array, if a local node is found, swap it with
+ the first element of the array.
+ If a local rack node is found, swap it with the first element following
+ the local node.
+ If neither local node or local rack node is found, put a random replica
+ location at postion 0.
+ It leaves the rest nodes untouched.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new ouput stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = getCount().
+ @param newCapacity The new capacity in bytes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
]]>
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Group with the given groupname.
+ @param group group name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi.
+ @param ugi user
+ @return the {@link Subject} for the user identified by ugi]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi as a comma separated string in
+ conf as a property attr
+
+ The String starts with the user name followed by the default group names,
+ and other group names.
+
+ @param conf configuration
+ @param attr property name
+ @param ugi a UnixUserGroupInformation]]>
+
+
+
+
+
+
+
+ conf
+
+ The object is expected to store with the property name attr
+ as a comma separated string that starts
+ with the user name followed by group names.
+ If the property name is not defined, return null.
+ It's assumed that there is only one UGI per user. If this user already
+ has a UGI in the ugi map, return the ugi in the map.
+ Otherwise, construct a UGI from the configuration, store it in the
+ ugi map and return it.
+
+ @param conf configuration
+ @param attr property name
+ @return a UnixUGI
+ @throws LoginException if the stored string is ill-formatted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ User with the given username.
+ @param user user name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ does not provide the stack trace for security purposes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ service as related to
+ Service Level Authorization for Hadoop.
+
+ Each service defines it's configuration key and also the necessary
+ {@link Permission} required to access the service.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash functions to consider.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+
+ this filter.
+ @param key The key to add.]]>
+
+
+
+
+
+ this filter.
+ @param key The key to test.
+ @return boolean True if the specified key belongs to this filter.
+ False otherwise.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to AND with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to OR with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to XOR with.]]>
+
+
+
+
+ this filter.
+
+ The result is assigned to this filter.]]>
+
+
+
+
+
+ this filter.
+ @param keys The list of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The collection of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The array of keys.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A filter is a data structure which aims at offering a lossy summary of a set A. The
+ key idea is to map entries of A (also called keys) into several positions
+ in a vector through the use of several hash functions.
+
+ Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).
+
+ It must be extended in order to define the real behavior.
+
+ @see Key The general behavior of a key
+ @see HashFunction A hash function]]>
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds a key with a default weight.
+ @param value The byte value of this key.]]>
+
+
+
+
+
+ Builds a key with a specified weight.
+ @param value The value of this key.
+ @param weight The weight associated to this key.]]>
+
+
+
+
+
+
+
+
+
+
+
+ this key.]]>
+
+
+
+
+ this key.]]>
+
+
+
+
+
+ this key with a specified value.
+ @param weight The increment.]]>
+
+
+
+
+ this key by one.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ length, and
+ the provided seed value
+ @param bytes input bytes
+ @param length length of the valid bytes to consider
+ @param initval seed value
+ @return hash value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The best hash table sizes are powers of 2. There is no need to do mod
+ a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.
+ For example, if you need only 10 bits, do
+ h = (h & hashmask(10));
+ In which case, the hash table should have hashsize(10) elements.
+
+
If you are hashing n strings byte[][] k, do it like this:
+ for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);
+
+
By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this
+ code any way you wish, private, educational, or commercial. It's free.
+
+
Use for hash table lookup, or anything where one collision in 2^^32 is
+ acceptable. Do NOT use for cryptographic purposes.]]>
+
+
+
+
+
+
+
+
+
+
+ lookup3.c, by Bob Jenkins, May 2006, Public Domain.
+
+ You can use this free for any purpose. It's in the public domain.
+ It has no warranty.
+
+
+ @see lookup3.c
+ @see Hash Functions (and how this
+ function compares to others such as CRC, MD?, etc
+ @see Has update on the
+ Dr. Dobbs Article]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The C version of MurmurHash 2.0 found at that site was ported
+ to Java by Andrzej Bialecki (ab at getopt org).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker,
+ as {@link JobTracker.State}
+
+ @return the current state of the JobTracker.]]>
+
+
+
+
+ JobTracker
+
+ @return the size of heap memory used by the JobTracker]]>
+
+
+
+
+ JobTracker
+
+ @return the configured size of max heap memory that can be used by the JobTracker]]>
+
+
+
+
+
+
+
+
+
+
+
+ ClusterStatus provides clients with information such as:
+
+
+ Size of the cluster.
+
+
+ Name of the trackers.
+
+
+ Task capacity of the cluster.
+
+
+ The number of currently running map & reduce tasks.
+
+
+ State of the JobTracker.
+
+
+
+
Clients can query for the latest ClusterStatus, via
+ {@link JobClient#getClusterStatus()}.
Counters are bunched into {@link Group}s, each comprising of
+ counters from a particular Enum class.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Counters} instead.]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.lib.input.FileInputFormat}
+ instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputSplit} instead.]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is an application-specified aggregation operation, which
+ can help cut down the amount of data transferred between the
+ {@link Mapper} and the {@link Reducer}, leading to better performance.
+
+
The framework may invoke the combiner 0, 1, or multiple times, in both
+ the mapper and reducer tasks. In general, the combiner is called as the
+ sort/merge result is written to disk. The combiner must:
+
+
be side-effect free
+
have the same input and output key types and the same input and
+ output value types
+
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ mapred.task.maxvmem is split into
+ mapred.job.map.memory.mb
+ and mapred.job.map.memory.mb,mapred
+ each of the new key are set
+ as mapred.task.maxvmem / 1024
+ as new values are in MB
+
+ @return The maximum amount of memory any task of this job will use, in
+ bytes.
+ @see #setMaxVirtualMemoryForTask(long)
+ @deprecated Use {@link #getMemoryForMapTask()} and
+ {@link #getMemoryForReduceTask()}]]>
+
+
+
+
+
+
+ mapred.task.maxvmem is split into
+ mapred.job.map.memory.mb
+ and mapred.job.map.memory.mb,mapred
+ each of the new key are set
+ as mapred.task.maxvmem / 1024
+ as new values are in MB
+
+ @param vmem Maximum amount of virtual memory in bytes any task of this job
+ can use.
+ @see #getMaxVirtualMemoryForTask()
+ @deprecated
+ Use {@link #setMemoryForMapTask(long mem)} and
+ Use {@link #setMemoryForReduceTask(long mem)}]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("map.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Mapper} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileInputFormat} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileSplit} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputCommitter} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if there exists a key/value,
+ false otherwise.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RawKeyValueIterator is an iterator used to iterate over
+ the raw keys and values during sort/merge of intermediate data.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Reducer} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordReader's for CombineFileSplit's.
+ @see CombineFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CombineFileSplit can be used to implement {@link org.apache.hadoop.mapred.RecordReader}'s,
+ with reading one record per file.
+ @see org.apache.hadoop.mapred.FileSplit
+ @see CombineFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple. The InputFormat
+ also creates the {@link RecordReader} to read the {@link InputSplit}.
+
+ @param context job configuration.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibility to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ OutputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Mapper to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Partitioner to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker is lost]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1.
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see org.apache.hadoop.mapred.JobTracker#getNewJobId()
+ @see org.apache.hadoop.mapred.JobTracker#getStartTime()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the key input type to the Mapper
+ @param the value input type to the Mapper
+ @param the key output type from the Mapper
+ @param the value output type from the Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link Configuration} for
+ the job via the {@link JobContext#getConfiguration()}.
+
+
The framework first calls
+ {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
+ {@link #map(Object, Object, Context)}
+ for each key/value pair in the InputSplit. Finally
+ {@link #cleanup(Context)} is called.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the sorting and grouping by
+ specifying two key {@link RawComparator} classes.
+
+
The Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the Configuration.
+
+
If the job has zero
+ reduces then the output of the Mapper is directly written
+ to the {@link OutputFormat} without sorting by keys.
+
+
Example:
+
+ public class TokenCounterMapper
+ extends Mapper
+
+
Applications may override the {@link #run(Context)} method to exert
+ greater control on map processing e.g. multi-threaded Mappers
+ etc.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param context information about the job
+ @throws IOException when output should not be attempted]]>
+
+
+
+
+
+
+
+
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be partioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordWriter to future operations.
+
+ @param context the context of the task
+ @throws IOException]]>
+
+
+
+ RecordWriter writes the output <key, value> pairs
+ to an output file.
+
+
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the input keys
+ @param the class of the input values
+ @param the class of the output keys
+ @param the class of the output values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer implementations
+ can access the {@link Configuration} for the job via the
+ {@link JobContext#getConfiguration()} method.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
The Reducer copies the sorted output from each
+ {@link Mapper} using HTTP across the network.
+
+
+
+
Sort
+
+
The framework merge sorts Reducer inputs by
+ keys
+ (since different Mappers may have output the same key).
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
To achieve a secondary sort on the values returned by the value
+ iterator, the application should extend the key with the secondary
+ key and define a grouping comparator. The keys will be sorted using the
+ entire key, but will be grouped using the grouping comparator to decide
+ which keys and values are sent in the same call to reduce.The grouping
+ comparator is specified via
+ {@link Job#setGroupingComparatorClass(Class)}. The sort order is
+ controlled by
+ {@link Job#setSortComparatorClass(Class)}.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterable, Context)}
+ method is called for each <key, (collection of values)> in
+ the sorted inputs.
+
The output of the reduce task is typically written to a
+ {@link RecordWriter} via
+ {@link Context#write(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class IntSumReducer extends Reducer {
+ private IntWritable result = new IntWritable();
+
+ public void reduce(Key key, Iterable values,
+ Context context) throws IOException {
+ int sum = 0;
+ for (IntWritable val : values) {
+ sum += val.get();
+ }
+ result.set(sum);
+ context.collect(key, result);
+ }
+ }
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the input key type for the task
+ @param the input value type for the task
+ @param the output key type for the task
+ @param the output value type for the task]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param context the job context
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobContext)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(JobContext, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the map's input key type
+ @param the map's input value type
+ @param the map's output key type
+ @param the map's output value type
+ @param job the job
+ @return the mapper class to run]]>
+
+
+
+
+
+
+ the map input key type
+ @param the map input value type
+ @param the map output key type
+ @param the map output value type
+ @param job the job to modify
+ @param cls the class to use as the mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Mapper implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured with the mapper to use via
+ {@link #setMapperClass(Configuration, Class)} and
+ the number of thread the thread-pool can use with the
+ {@link #getNumberOfThreads(Configuration) method. The default
+ value is 10 threads.
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in a work directory during execution
+ of his task i.e. via
+ {@link #getWorkOutputPath(TaskInputOutputContext)}, and
+ the framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueFile} method to make the file name
+ unique for the task.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/common/jdiff/hadoop_0.20.2.xml b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.2.xml
new file mode 100644
index 0000000..36a2a4b
--- /dev/null
+++ b/x86_64/share/hadoop/common/jdiff/hadoop_0.20.2.xml
@@ -0,0 +1,53959 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ final.
+
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+
+
+
+
+
+ final.
+
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+
+
+
+
+
+ final.
+
+ @param in InputStream to deserialize the object from.]]>
+
+
+
+
+
+
+
+
+
+
+ name property, null if
+ no such property exists.
+
+ Values are processed for variable expansion
+ before being returned.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+ name property, without doing
+ variable expansion.
+
+ @param name the property name.
+ @return the value of the name property,
+ or null if no such property exists.]]>
+
+
+
+
+
+
+ value of the name property.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property. If no such property
+ exists, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value, or defaultValue if the property
+ doesn't exist.]]>
+
+
+
+
+
+
+ name property as an int.
+
+ If no such property exists, or if the specified value is not a valid
+ int, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as an int,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to an int.
+
+ @param name property name.
+ @param value int value of the property.]]>
+
+
+
+
+
+
+ name property as a long.
+ If no such property is specified, or if the specified value is not a valid
+ long, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a long,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a long.
+
+ @param name property name.
+ @param value long value of the property.]]>
+
+
+
+
+
+
+ name property as a float.
+ If no such property is specified, or if the specified value is not a valid
+ float, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a float,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a float.
+
+ @param name property name.
+ @param value property value.]]>
+
+
+
+
+
+
+ name property as a boolean.
+ If no such property is specified, or if the specified value is not a valid
+ boolean, then defaultValue is returned.
+
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean,
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property to a boolean.
+
+ @param name property name.
+ @param value boolean value of the property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property as
+ a collection of Strings.
+ If no such property is specified then empty collection is returned.
+
+ This is an optimized version of {@link #getStrings(String)}
+
+ @param name property name.
+ @return property value as a collection of Strings.]]>
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then null is returned.
+
+ @param name property name.
+ @return property value as an array of Strings,
+ or null.]]>
+
+
+
+
+
+
+ name property as
+ an array of Strings.
+ If no such property is specified then default value is returned.
+
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of Strings,
+ or default value.]]>
+
+
+
+
+
+
+ name property as
+ as comma delimited values.
+
+ @param name property name.
+ @param values The values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ name property
+ as an array of Class.
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[],
+ or defaultValue.]]>
+
+
+
+
+
+
+ name property as a Class.
+ If no such property is specified, then defaultValue is
+ returned.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property as a Class
+ implementing the interface specified by xface.
+
+ If no such property is specified, then defaultValue is
+ returned.
+
+ An exception is thrown if the returned class does not implement the named
+ interface.
+
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class,
+ or defaultValue.]]>
+
+
+
+
+
+
+
+ name property to the name of a
+ theClass implementing the given interface xface.
+
+ An exception is thrown if theClass does not implement the
+ interface xface.
+
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+
+
+
+
+
+
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+
+
+
+
+
+ name.
+
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ String
+ key-value pairs in the configuration.
+
+ @return an iterator over the entries.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true to set quiet-mode on, false
+ to turn it off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Resources
+
+
Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String or by a {@link Path}. If named by a String,
+ then the classpath is examined for a file with that name. If named by a
+ Path, then the local filesystem is examined directly, without
+ referring to the classpath.
+
+
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+
+
Final Parameters
+
+
Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+
+
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.]]>
+
Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}.
+ The DistributedCache assumes that the
+ files specified via hdfs:// urls are already present on the
+ {@link FileSystem} at the path specified by the url.
+
+
The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+
+
DistributedCache can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ Optionally users can also direct it to symlink the distributed cache file(s)
+ into the working directory of the task.
+
+
DistributedCache tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+
+
Here is an illustrative example on how to use the
+ DistributedCache:
+
+ // Setting up the cache for the application
+
+ 1. Copy the requisite files to the FileSystem:
+
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+
+ 2. Setup the application's JobConf:
+
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+
+ private Path[] localArchives;
+ private Path[] localFiles;
+
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ localArchives = DistributedCache.getLocalCacheArchives(job);
+ localFiles = DistributedCache.getLocalCacheFiles(job);
+ }
+
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+
+
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+
+
+
+
+
+
?
+
Matches any single character.
+
+
+
*
+
Matches zero or more characters.
+
+
+
[abc]
+
Matches a single character from character set
+ {a,b,c}.
+
+
+
[a-b]
+
Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
+
+
+
[^a]
+
Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
+
+
+
\c
+
Removes (escapes) any special meaning of character c.
+
+
+
{ab,cd}
+
Matches a string from the string set {ab, cd}
+
+
+
{ab,c{de,fh}}
+
Matches a string from the string set {ab, cde, cfh}
+
+
+
+
+
+ @param pathPattern a regular expression specifying a pth pattern
+
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+
+
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ buf at offset
+ and checksum into checksum.
+ The method is used for implementing read, therefore, it should be optimized
+ for sequential reading
+ @param pos chunkPos
+ @param buf desitination buffer
+ @param offset offset in buf at which to store data
+ @param len maximun number of bytes to read
+ @return number of bytes read]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ -1 if the end of the
+ stream is reached.
+ @exception IOException if an I/O error occurs.]]>
+
+
+
+
+
+
+
+
+ This method implements the general contract of the corresponding
+ {@link InputStream#read(byte[], int, int) read} method of
+ the {@link InputStream} class. As an additional
+ convenience, it attempts to read as many bytes as possible by repeatedly
+ invoking the read method of the underlying stream. This
+ iterated read continues until one of the following
+ conditions becomes true:
+
+
The specified number of bytes have been read,
+
+
The read method of the underlying stream returns
+ -1, indicating end-of-file.
+
+
If the first read on the underlying stream returns
+ -1 to indicate end-of-file then this method returns
+ -1. Otherwise this method returns the number of bytes
+ actually read.
+
+ @param b destination buffer.
+ @param off offset at which to start storing bytes.
+ @param len maximum number of bytes to read.
+ @return the number of bytes read, or -1 if the end of
+ the stream has been reached.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if any checksum error occurs]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ n bytes of data from the
+ input stream.
+
+
This method may skip more bytes than are remaining in the backing
+ file. This produces no exception and the number of bytes skipped
+ may include some number of bytes that were beyond the EOF of the
+ backing file. Attempting to read from the stream after skipping past
+ the end will result in -1 indicating the end of the file.
+
+
If n is negative, no bytes are skipped.
+
+ @param n the number of bytes to be skipped.
+ @return the actual number of bytes skipped.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to skip to is corrupted]]>
+
+
+
+
+
+
+ This method may seek past the end of the file.
+ This produces no exception and an attempt to read from
+ the stream will result in -1 indicating the end of the file.
+
+ @param pos the postion to seek to.
+ @exception IOException if an I/O error occurs.
+ ChecksumException if the chunk to seek to is corrupted]]>
+
+
+
+
+
+
+
+
+
+ len bytes from
+ stm
+
+ @param stm an input stream
+ @param buf destiniation buffer
+ @param offset offset at which to store data
+ @param len number of bytes to read
+ @return actual number of bytes read
+ @throws IOException if there is any IO error]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len bytes from the specified byte array
+ starting at offset off and generate a checksum for
+ each data chunk.
+
+
This method stores bytes from the given array into this
+ stream's buffer before it gets checksumed. The buffer gets checksumed
+ and flushed to the underlying output stream when all data
+ in a checksum chunk are in the buffer. If the buffer is empty and
+ requested length is at least as large as the size of next checksum chunk
+ size, this method will checksum and write the chunk directly
+ to the underlying output stream. Thus it avoids uneccessary data copy.
+
+ @param b the data.
+ @param off the start offset in the data.
+ @param len the number of bytes to write.
+ @exception IOException if an I/O error occurs.]]>
+
+
+ DataInputBuffer buffer = new DataInputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using DataInput methods ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new DataOutputStream and
+ ByteArrayOutputStream each time data is written.
+
+
Typical usage is something like the following:
+
+ DataOutputBuffer buffer = new DataOutputBuffer();
+ while (... loop condition ...) {
+ buffer.reset();
+ ... write buffer using DataOutput methods ...
+ byte[] data = buffer.getData();
+ int dataLength = buffer.getLength();
+ ... write data to its ultimate destination ...
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+
+
+
+
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+
+
+
+
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+
+ @param the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a DoubleWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a FloatWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+
+
+
+ Compared with ObjectWritable, this class is much more effective,
+ because ObjectWritable will append the class declaration as a String
+ into the output file in every Key-Value pair.
+
+
+
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+
+
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes(), defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes() method, must
+ implement Writable interface.
+
+
+ @since Nov 8, 2006]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This saves memory over creating a new InputStream and
+ ByteArrayInputStream each time data is read.
+
+
Typical usage is something like the following:
+
+ InputBuffer buffer = new InputBuffer();
+ while (... loop condition ...) {
+ byte[] data = ... get data ...;
+ int dataLength = ... get data length ...;
+ buffer.reset(data, dataLength);
+ ... read buffer using InputStream methods ...
+ }
+
+ @see DataInputBuffer
+ @see DataOutput]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a IntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a LongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A map is a directory containing two files, the data file,
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+
+
The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
+
+
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+
SequenceFile provides {@link Writer}, {@link Reader} and
+ {@link Sorter} classes for writing, reading and sorting respectively.
+
+ There are three SequenceFileWriters based on the
+ {@link CompressionType} used to compress key/value pairs:
+
+
+ Writer : Uncompressed records.
+
+
+ RecordCompressWriter : Record-compressed files, only compress
+ values.
+
+
+ BlockCompressWriter : Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+
+
+
The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+
+
The recommended way is to use the static createWriter methods
+ provided by the SequenceFile to chose the preferred format.
+
+
The {@link Reader} acts as the bridge and can read any of the above
+ SequenceFile formats.
+
+
SequenceFile Formats
+
+
Essentially there are 3 different formats for SequenceFiles
+ depending on the CompressionType specified. All of them share a
+ common header described below.
+
+
SequenceFile Header
+
+
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+
+
+ keyClassName -key class
+
+
+ valueClassName - value class
+
+
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+
+
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+
+
+ compression codec - CompressionCodec class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+
+
+ metadata - {@link Metadata} for this file.
+
+
+ sync - A sync marker to denote end of the header.
+
The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+
+ @see CompressionCodec]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key, skipping its
+ value. True if another entry exists, and false at end of file.]]>
+
+
+
+
+
+
+
+ key and
+ val. Returns true if such a pair exists and false when at
+ end of file]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The position passed must be a position returned by {@link
+ SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary
+ position, use {@link SequenceFile.Reader#sync(long)}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SegmentDescriptor
+ @param segments the list of SegmentDescriptors
+ @param tmpDir the directory to write temporary files into
+ @return RawKeyValueIterator
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For best performance, applications should make sure that the {@link
+ Writable#readFields(DataInput)} implementation of their keys is
+ very efficient. In particular, it should avoid allocating memory.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This always returns a synchronized position. In other words,
+ immediately after calling {@link SequenceFile.Reader#seek(long)} with a position
+ returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However
+ the key may be earlier in the file than key last written when this
+ method was called (e.g., with block-compression, it may be the first key
+ in the block that was being written when this method was called).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ key. Returns
+ true if such a key exists and false when at the end of the set.]]>
+
+
+
+
+
+
+ key.
+ Returns key, or null if no match exists.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the objects to stringify]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ position. Note that this
+ method avoids using the converter or doing String instatiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+
+
+
+
+
+
+
+
+
+ what in the backing
+ buffer, starting as position start. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a Text with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string.
Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a UTF8 with the same contents.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Also includes utilities for efficiently reading and writing UTF-8.
+
+ @deprecated replaced by Text]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VIntWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ o is a VLongWritable with the same value.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ out.
+
+ @param out DataOuput to serialize this object into.
+ @throws IOException]]>
+
+
+
+
+
+
+ in.
+
+
For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+
+ @param in DataInput to deseriablize this object from.
+ @throws IOException]]>
+
+
+
+ Any key or value type in the Hadoop Map-Reduce
+ framework implements this interface.
+
+
Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+
+
Example:
+
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+
]]>
+
+
+
+
+
+
+
+
+ WritableComparables can be compared to each other, typically
+ via Comparators. Any type which is to be used as a
+ key in the Hadoop Map-Reduce framework should implement this
+ interface.
+
+
Example:
+
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public int compareTo(MyWritableComparable w) {
+ int thisValue = this.value;
+ int thatValue = ((IntWritable)o).value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ }
+
One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CompressionCodec for which to get the
+ Compressor
+ @return Compressor for the given
+ CompressionCodec from the pool or a new one]]>
+
+
+
+
+
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor for the given
+ CompressionCodec the pool or a new one]]>
+
+
+
+
+
+ Compressor to be returned to the pool]]>
+
+
+
+
+
+ Decompressor to be returned to the
+ pool]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true indicating that more input data is required.
+
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+
+
+
+
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if a preset dictionary is needed for decompression.
+ @return true if a preset dictionary is needed for decompression]]>
+
+
+
+
+ true if the end of the compressed
+ data output stream has been reached.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FIXME: This array should be in a private or package private location,
+ since it could be modified by malicious code.
+ ]]>
+
+
+
+
+ This interface is public for historical purposes. You should have no need to
+ use it.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Although BZip2 headers are marked with the magic "Bz" this
+ constructor expects the next byte in the stream to be the first one after
+ the magic. Thus callers have to skip the first two bytes. Otherwise this
+ constructor will throw an exception.
+
+
+ @throws IOException
+ if the stream content is malformed or an I/O error occurs.
+ @throws NullPointerException
+ if in == null]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The decompression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2InputStream to release the allocated memory. See
+ {@link CBZip2OutputStream CBZip2OutputStream} for information about memory
+ usage.
+
+
+
+ CBZip2InputStream reads bytes from the compressed source stream via
+ the single byte {@link java.io.InputStream#read() read()} method exclusively.
+ Thus you should consider to use a buffered source stream.
+
+
+
+ Instances of this class are not threadsafe.
+
]]>
+
+
+
+
+
+
+
+
+
+ CBZip2OutputStream with a blocksize of 900k.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+ @param out *
+ the destination stream.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws NullPointerException
+ if out == null.]]>
+
+
+
+
+
+ CBZip2OutputStream with specified blocksize.
+
+
+ Attention: The caller is resonsible to write the two BZip2 magic
+ bytes "BZ" to the specified stream prior to calling this
+ constructor.
+
+
+
+ @param out
+ the destination stream.
+ @param blockSize
+ the blockSize as 100k units.
+
+ @throws IOException
+ if an I/O error occurs in the specified stream.
+ @throws IllegalArgumentException
+ if (blockSize < 1) || (blockSize > 9).
+ @throws NullPointerException
+ if out == null.
+
+ @see #MIN_BLOCKSIZE
+ @see #MAX_BLOCKSIZE]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ inputLength this method returns MAX_BLOCKSIZE
+ always.
+
+ @param inputLength
+ The length of the data which will be compressed by
+ CBZip2OutputStream.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ == 1.]]>
+
+
+
+
+ == 9.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If you are ever unlucky/improbable enough to get a stack overflow whilst
+ sorting, increase the following constant and try again. In practice I
+ have never seen the stack go above 27 elems, so the following limit seems
+ very generous.
+ ]]>
+
+
+
+
+ The compression requires large amounts of memory. Thus you should call the
+ {@link #close() close()} method as soon as possible, to force
+ CBZip2OutputStream to release the allocated memory.
+
+
+
+ You can shrink the amount of allocated memory and maybe raise the compression
+ speed by choosing a lower blocksize, which in turn may cause a lower
+ compression ratio. You can avoid unnecessary memory allocation by avoiding
+ using a blocksize which is bigger than the size of the input.
+
+
+
+ You can compute the memory usage for compressing by the following formula:
+
+
+
+ <code>400k + (9 * blocksize)</code>.
+
+
+
+ To get the memory required for decompression by {@link CBZip2InputStream
+ CBZip2InputStream} use
+
+
+
+ <code>65k + (5 * blocksize)</code>.
+
+
+
+
+
+
+
Memory usage by blocksize
+
+
+
Blocksize
Compression
+ memory usage
Decompression
+ memory usage
+
+
+
100k
+
1300k
+
565k
+
+
+
200k
+
2200k
+
1065k
+
+
+
300k
+
3100k
+
1565k
+
+
+
400k
+
4000k
+
2065k
+
+
+
500k
+
4900k
+
2565k
+
+
+
600k
+
5800k
+
3065k
+
+
+
700k
+
6700k
+
3565k
+
+
+
800k
+
7600k
+
4065k
+
+
+
900k
+
8500k
+
4565k
+
+
+
+
+ For decompression CBZip2InputStream allocates less memory if the
+ bzipped input is smaller than one block.
+
Seek by key or by file offset.
+
+ The memory footprint of a TFile includes the following:
+
+
Some constant overhead of reading or writing a compressed block.
+
+
Each compressed block requires one compression/decompression codec for
+ I/O.
+
Temporary space to buffer the key.
+
Temporary space to buffer the value (for TFile.Writer only). Values are
+ chunk encoded, so that we buffer at most one chunk of user data. By default,
+ the chunk buffer is 1MB. Reading chunked value does not require additional
+ memory.
+
+
TFile index, which is proportional to the total number of Data Blocks.
+ The total amount of memory needed to hold the index can be estimated as
+ (56+AvgKeySize)*NumBlocks.
+
MetaBlock index, which is proportional to the total number of Meta
+ Blocks.The total amount of memory needed to hold the index for Meta Blocks
+ can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.
+
+
+ The behavior of TFile can be customized by the following variables through
+ Configuration:
+
+
tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default
+ to 1MB. Values of the length less than the chunk size is guaranteed to have
+ known value length in read time (See
+ {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
+
tfile.fs.output.buffer.size: Buffer size used for
+ FSDataOutputStream. Integer (in bytes). Default to 256KB.
+
tfile.fs.input.buffer.size: Buffer size used for
+ FSDataInputStream. Integer (in bytes). Default to 256KB.
+
+
+ Suggestions on performance optimization.
+
+
Minimum block size. We recommend a setting of minimum block size between
+ 256KB to 1MB for general usage. Larger block size is preferred if files are
+ primarily for sequential access. However, it would lead to inefficient random
+ access (because there are more data to decompress). Smaller blocks are good
+ for random access, but require more memory to hold the block index, and may
+ be slower to create (because we must flush the compressor stream at the
+ conclusion of each data block, which leads to an FS I/O flush). Further, due
+ to the internal caching in Compression codec, the smallest possible block
+ size would be around 20KB-30KB.
+
The current implementation does not offer true multi-threading for
+ reading. The implementation uses FSDataInputStream seek()+read(), which is
+ shown to be much faster than positioned-read call in single thread mode.
+ However, it also means that if multiple threads attempt to access the same
+ TFile (using multiple scanners) simultaneously, the actual I/O is carried out
+ sequentially even if they access different DFS blocks.
+
Compression codec. Use "none" if the data is not very compressable (by
+ compressable, I mean a compression ratio at least 2:1). Generally, use "lzo"
+ as the starting point for experimenting. "gz" overs slightly better
+ compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to
+ decompress, comparing to "lzo".
+
File system buffering, if the underlying FSDataInputStream and
+ FSDataOutputStream is already adequately buffered; or if applications
+ reads/writes keys and values in large buffers, we can reduce the sizes of
+ input/output buffering in TFile layer by setting the configuration parameters
+ "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
+
+
+ Some design rationale behind TFile can be found at Hadoop-3315.]]>
+
+ Use {@link Scanner#advance()} to move the cursor to the next key-value
+ pair (or end if none exists). Use seekTo methods (
+ {@link Scanner#seekTo(byte[])} or
+ {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary
+ location in the covered range (including backward seeking). Use
+ {@link Scanner#rewind()} to seek back to the beginning of the scanner.
+ Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.
+
+ Actual keys and values may be obtained through {@link Scanner.Entry}
+ object, which is obtained through {@link Scanner#entry()}.]]>
+
Algorithmic comparator: binary comparators that is language
+ independent. Currently, only "memcmp" is supported.
+
Language-specific comparator: binary comparators that can
+ only be constructed in specific language. For Java, the syntax
+ is "jclass:", followed by the class name of the RawComparator.
+ Currently, we only support RawComparators that can be
+ constructed through the default constructor (with no
+ parameters). Parameterized RawComparators such as
+ {@link WritableComparator} or
+ {@link JavaSerializationComparator} may not be directly used.
+ One should write a wrapper class that inherits from such classes
+ and use its default constructor to perform proper
+ initialization.
+
+ @param conf
+ The configuration object.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If an exception is thrown, the TFile will be in an inconsistent
+ state. The only legitimate call after that would be close]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Utils#writeVLong(out, n).
+
+ @param out
+ output stream
+ @param n
+ The integer to be encoded
+ @throws IOException
+ @see Utils#writeVLong(DataOutput, long)]]>
+
+
+
+
+
+
+
+
+
if n in [-32, 127): encode in one byte with the actual value.
+ Otherwise,
+
if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52;
+ byte[1]=n&0xff. Otherwise,
+
if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 -
+ 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
+
if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112;
+ byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
+
if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] =
+ (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
+
if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] =
+ (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff;
+ byte[4]=(n>>8)&0xff; byte[5]=n&0xff
+
if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] =
+ (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff;
+ byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
+
if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] =
+ (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff;
+ byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff;
+ byte[7]=n&0xff;
+
if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] =
+ (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff;
+ byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff;
+ byte[7]=(n>>8)&0xff; byte[8]=n&0xff;
+
+
+ @param out
+ output stream
+ @param n
+ the integer number
+ @throws IOException]]>
+
if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
+
if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 +
+ NB[1]&0xff;
+
if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 +
+ (NB[1]&0xff)<<8 + NB[2]&0xff;
+
if (FB in [-128, -121]), return interpret NB[FB+129] as a signed
+ big-endian integer.
+
+ @param in
+ input stream
+ @return the decoded long integer.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying for a maximum time, waiting a fixed time between attempts,
+ and then fail by re-throwing the exception.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by the number of tries so far.
+ ]]>
+
+
+
+
+
+
+
+
+ Keep trying a limited number of times, waiting a growing amount of time between attempts,
+ and then fail by re-throwing the exception.
+ The time between attempts is sleepTime mutliplied by a random
+ number in the range of [0, 2 to the number of retries)
+ ]]>
+
+
+
+
+
+
+
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+
+
+ A retry policy for RemoteException
+ Set a default policy with some explicit handlers for specific exceptions.
+ ]]>
+
+
+
+
+
+ Try once, and fail by re-throwing the exception.
+ This corresponds to having no retry mechanism in place.
+ ]]>
+
+
+
+
+
+ Try once, and fail silently for void methods, or by
+ re-throwing the exception for non-void methods.
+ ]]>
+
+
+
+
+
+ Keep trying forever.
+ ]]>
+
+
+
+
+ A collection of useful implementations of {@link RetryPolicy}.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Determines whether the framework should retry a
+ method for the given exception, and the number
+ of retries that have been made for that operation
+ so far.
+
+ @param e The exception that caused the method to fail.
+ @param retries The number of times the method has been retried.
+ @return true if the method should be retried,
+ false if the method should not be retried
+ but shouldn't fail with an exception (only for void methods).
+ @throws Exception The re-thrown exception e indicating
+ that the method failed and should not be retried further.]]>
+
+
+
+
+ Specifies a policy for retrying method failures.
+ Implementations of this interface should be immutable.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the same retry policy for each method in the interface.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param retryPolicy the policy for retirying method call failures
+ @return the retry proxy]]>
+
+
+
+
+
+
+
+
+ Create a proxy for an interface of an implementation class
+ using the a set of retry policies specified by method name.
+ If no retry policy is defined for a method then a default of
+ {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.
+
+ @param iface the interface that the retry will implement
+ @param implementation the instance whose methods should be retried
+ @param methodNameToPolicyMap a map of method names to retry policies
+ @return the retry proxy]]>
+
+
+
+
+ A factory for creating retry proxies.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+ Prepare the deserializer for reading.]]>
+
+
+
+
+
+
+
+ Deserialize the next object from the underlying input stream.
+ If the object t is non-null then this deserializer
+ may set its internal state to the next object read from the input
+ stream. Otherwise, if the object t is null a new
+ deserialized object will be created.
+
+ @return the deserialized object]]>
+
+
+
+
+
+ Close the underlying input stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for deserializing objects of type from an
+ {@link InputStream}.
+
+
+
+ Deserializers are stateful, but must not buffer the input since
+ other producers may read from the input between calls to
+ {@link #deserialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link Deserializer} to deserialize
+ the objects to be compared so that the standard {@link Comparator} can
+ be used to compare them.
+
+
+ One may optimize compare-intensive operations by using a custom
+ implementation of {@link RawComparator} that operates directly
+ on byte representations.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+
+ @see JavaSerializationComparator]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+
+ @param
+ @see JavaSerialization]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Encapsulates a {@link Serializer}/{@link Deserializer} pair.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+ Serializations are found by reading the io.serializations
+ property from conf, which is a comma-delimited list of
+ classnames.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A factory for {@link Serialization}s.
+ ]]>
+
+
+
+
+
+
+
+
+
+ Prepare the serializer for writing.]]>
+
+
+
+
+
+
+ Serialize t to the underlying output stream.]]>
+
+
+
+
+
+ Close the underlying output stream and clear up any resources.]]>
+
+
+
+
+ Provides a facility for serializing objects of type to an
+ {@link OutputStream}.
+
+
+
+ Serializers are stateful, but must not buffer the output since
+ other producers may write to the output between calls to
+ {@link #serialize(Object)}.
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address, returning the value. Throws exceptions if there are
+ network problems or if the remote code threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address with the ticket credentials, returning
+ the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.
+ @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
+
+
+
+
+
+
+
+
+
+
+ param, to the IPC server running at
+ address which is servicing the protocol protocol,
+ with the ticket credentials, returning the value.
+ Throws exceptions if there are network problems or if the remote code
+ threw an exception.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unwraps any IOException.
+
+ @param lookupTypes the desired exception class.
+ @return IOException, which is either the lookupClass exception or this.]]>
+
+
+
+
+ This unwraps any Throwable that has a constructor taking
+ a String as a parameter.
+ Otherwise it returns this.
+
+ @return Throwable]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ protocol is a Java interface. All parameters and return types must
+ be one of:
+
+
a primitive type, boolean, byte,
+ char, short, int, long,
+ float, double, or void; or
+
+
a {@link String}; or
+
+
a {@link Writable}; or
+
+
an array of the above types
+
+ All methods in the protocol should throw only IOException. No field data of
+ the protocol instance is transmitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ handlerCount determines
+ the number of handler threads that will be used to process calls.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name=RpcActivityForPort"
+
+ Many of the activity metrics are sampled and averaged on an interval
+ which can be specified in the metrics config file.
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link #rpcQueueTime}.inc(time)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For the statistics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When constructing the instance, if the factory property
+ contextName.class exists,
+ its value is taken to be the name of the class to instantiate. Otherwise,
+ the default is to create an instance of
+ org.apache.hadoop.metrics.spi.NullContext, which is a
+ dummy "no-op" context which will cause all metric data to be discarded.
+
+ @param contextName the name of the context
+ @return the named MetricsContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ When the instance is constructed, this method checks if the file
+ hadoop-metrics.properties exists on the class path. If it
+ exists, it must be in the format defined by java.util.Properties, and all
+ the properties in the file are set as attributes on the newly created
+ ContextFactory instance.
+
+ @return the singleton ContextFactory instance]]>
+
+
+
+ getFactory() method.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ startMonitoring() again after calling
+ this.
+ @see #close()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A record name identifies the kind of data to be reported. For example, a
+ program reporting statistics relating to the disks on a computer might use
+ a record name "diskStats".
+
+ A record has zero or more tags. A tag has a name and a value. To
+ continue the example, the "diskStats" record might use a tag named
+ "diskName" to identify a particular disk. Sometimes it is useful to have
+ more than one tag, so there might also be a "diskType" with value "ide" or
+ "scsi" or whatever.
+
+ A record also has zero or more metrics. These are the named
+ values that are to be reported to the metrics system. In the "diskStats"
+ example, possible metric names would be "diskPercentFull", "diskPercentBusy",
+ "kbReadPerSecond", etc.
+
+ The general procedure for using a MetricsRecord is to fill in its tag and
+ metric values, and then call update() to pass the record to the
+ client library.
+ Metric data is not immediately sent to the metrics system
+ each time that update() is called.
+ An internal table is maintained, identified by the record name. This
+ table has columns
+ corresponding to the tag and the metric names, and rows
+ corresponding to each unique set of tag values. An update
+ either modifies an existing row in the table, or adds a new row with a set of
+ tag values that are different from all the other rows. Note that if there
+ are no tags, then there can be at most one row in the table.
+
+ Once a row is added to the table, its data will be sent to the metrics system
+ on every timer period, whether or not it has been updated since the previous
+ timer period. If this is inappropriate, for example if metrics were being
+ reported by some transient object in an application, the remove()
+ method can be used to remove the row and thus stop the data from being
+ sent.
+
+ Note that the update() method is atomic. This means that it is
+ safe for different threads to be updating the same metric. More precisely,
+ it is OK for different threads to call update() on MetricsRecord instances
+ with the same set of tag names and tag values. Different threads should
+ not use the same MetricsRecord instance at the same time.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MetricsContext.registerUpdater().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ fileName attribute,
+ if specified. Otherwise the data will be written to standard
+ output.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class is configured by setting ContextFactory attributes which in turn
+ are usually configured through a properties file. All the attributes are
+ prefixed by the contextName. For example, the properties file might contain:
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName is not in that set.
+
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord method in order to transmit
+ the data. ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ update
+ and remove().]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+
+ @return a list of InetSocketAddress objects.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ,name="
+ Where the and are the supplied parameters
+
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ hadoop.rpc.socket.factory.class.<ClassName>. When no
+ such parameter exists then fall back on the default socket factory as
+ configured by hadoop.rpc.socket.factory.class.default. If
+ this default socket factory is not configured, then fall back on the JVM
+ default socket factory.
+
+ @param conf the configuration
+ @param clazz the class (usually a {@link VersionedProtocol})
+ @return a socket factory]]>
+
+
+
+
+
+ hadoop.rpc.socket.factory.default
+
+ @param conf the configuration
+ @return the default socket factory as specified in the configuration or
+ the JVM default socket factory if the configuration does not
+ contain a default socket factory property.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+ :
+ ://:/]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From documentation for {@link #getInputStream(Socket, long)}:
+ Returns InputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketInputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getInputStream()} is returned. In the later
+ case, the timeout argument is ignored and the timeout set with
+ {@link Socket#setSoTimeout(int)} applies for reads.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see #getInputStream(Socket, long)
+
+ @param socket
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getInputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return InputStream for reading from the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ From documentation for {@link #getOutputStream(Socket, long)} :
+ Returns OutputStream for the socket. If the socket has an associated
+ SocketChannel then it returns a
+ {@link SocketOutputStream} with the given timeout. If the socket does not
+ have a channel, {@link Socket#getOutputStream()} is returned. In the later
+ case, the timeout argument is ignored and the write will wait until
+ data is available.
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see #getOutputStream(Socket, long)
+
+ @param socket
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+ Any socket created using socket factories returned by {@link #NetUtils},
+ must use this interface instead of {@link Socket#getOutputStream()}.
+
+ @see Socket#getChannel()
+
+ @param socket
+ @param timeout timeout in milliseconds. This may not always apply. zero
+ for waiting as long as necessary.
+ @return OutputStream for writing to the socket.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+ socket.connect(endpoint, timeout). If
+ socket.getChannel() returns a non-null channel,
+ connect is implemented using Hadoop's selectors. This is done mainly
+ to avoid Sun's connect implementation from creating thread-local
+ selectors, since Hadoop does not have control on when these are closed
+ and could end up taking all the available file descriptors.
+
+ @see java.net.Socket#connect(java.net.SocketAddress, int)
+
+ @param socket
+ @param endpoint
+ @param timeout - timeout in milliseconds]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ node
+
+ @param node
+ a node
+ @return true if node is already in the tree; false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ scope
+ if scope starts with ~, choose one from the all nodes except for the
+ ones in scope; otherwise, choose one from scope
+ @param scope range of nodes from which a node will be choosen
+ @return the choosen node]]>
+
+
+
+
+
+
+ scope but not in excludedNodes
+ if scope starts with ~, return the number of nodes that are not
+ in scope and excludedNodes;
+ @param scope a path string that may start with ~
+ @param excludedNodes a list of nodes
+ @return number of available nodes]]>
+
+
+
+
+
+
+
+
+
+
+
+ reader
+ It linearly scans the array, if a local node is found, swap it with
+ the first element of the array.
+ If a local rack node is found, swap it with the first element following
+ the local node.
+ If neither local node or local rack node is found, put a random replica
+ location at postion 0.
+ It leaves the rest nodes untouched.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+ Create a new input stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+ @see SocketInputStream#SocketInputStream(ReadableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Create a new ouput stream with the given timeout. If the timeout
+ is zero, it will be treated as infinite timeout. The socket's
+ channel will be configured to be non-blocking.
+
+ @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long)
+
+ @param socket should have a channel associated with it.
+ @param timeout timeout timeout in milliseconds. must not be negative.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ = getCount().
+ @param newCapacity The new capacity in bytes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+
+
The task requires the file or the nested fileset element to be
+ specified. Optional attributes are language (set the output
+ language, default is "java"),
+ destdir (name of the destination directory for generated java/c++
+ code, default is ".") and failonerror (specifies error handling
+ behavior. default is true).
+
]]>
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Group with the given groupname.
+ @param group group name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi.
+ @param ugi user
+ @return the {@link Subject} for the user identified by ugi]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ugi as a comma separated string in
+ conf as a property attr
+
+ The String starts with the user name followed by the default group names,
+ and other group names.
+
+ @param conf configuration
+ @param attr property name
+ @param ugi a UnixUserGroupInformation]]>
+
+
+
+
+
+
+
+ conf
+
+ The object is expected to store with the property name attr
+ as a comma separated string that starts
+ with the user name followed by group names.
+ If the property name is not defined, return null.
+ It's assumed that there is only one UGI per user. If this user already
+ has a UGI in the ugi map, return the ugi in the map.
+ Otherwise, construct a UGI from the configuration, store it in the
+ ugi map and return it.
+
+ @param conf configuration
+ @param attr property name
+ @return a UnixUGI
+ @throws LoginException if the stored string is ill-formatted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ User with the given username.
+ @param user user name]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ does not provide the stack trace for security purposes.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ service as related to
+ Service Level Authorization for Hadoop.
+
+ Each service defines it's configuration key and also the necessary
+ {@link Permission} required to access the service.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ in]]>
+
+
+
+
+
+
+ out.]]>
+
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+ reset is true, then resets the checksum.
+ @return number of bytes written. Will be equal to getChecksumSize();]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser to parse only the generic Hadoop
+ arguments.
+
+ The array of string arguments other than the generic arguments can be
+ obtained by {@link #getRemainingArgs()}.
+
+ @param conf the Configuration to modify.
+ @param args command-line arguments.]]>
+
+
+
+
+ GenericOptionsParser to parse given options as well
+ as generic Hadoop options.
+
+ The resulting CommandLine object can be obtained by
+ {@link #getCommandLine()}.
+
+ @param conf the configuration to modify
+ @param options options built by the caller
+ @param args User-specified arguments]]>
+
+
+
+
+ Strings containing the un-parsed arguments
+ or empty array if commandLine was not defined.]]>
+
+
+
+
+
+
+
+
+
+ CommandLine object
+ to process the parsed arguments.
+
+ Note: If the object is created with
+ {@link #GenericOptionsParser(Configuration, String[])}, then returned
+ object will only contain parsed generic options.
+
+ @return CommandLine representing list of arguments
+ parsed against Options descriptor.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ GenericOptionsParser is a utility to parse command line
+ arguments generic to the Hadoop framework.
+
+ GenericOptionsParser recognizes several standarad command
+ line arguments, enabling applications to easily specify a namenode, a
+ jobtracker, additional configuration resources etc.
+
+
Generic Options
+
+
The supported generic options are:
+
+ -conf <configuration file> specify a configuration file
+ -D <property=value> use value for given property
+ -fs <local|namenode:port> specify a namenode
+ -jt <local|jobtracker:port> specify a job tracker
+ -files <comma separated list of files> specify comma separated
+ files to be copied to the map reduce cluster
+ -libjars <comma separated list of jars> specify comma separated
+ jar files to include in the classpath.
+ -archives <comma separated list of archives> specify comma
+ separated archives to be unarchived on the compute machines.
+
+
Generic command line arguments might modify
+ Configuration objects, given to constructors.
+
+
The functionality is implemented using Commons CLI.
+
+
Examples:
+
+ $ bin/hadoop dfs -fs darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
+ list /data directory in dfs with namenode darwin:8020
+
+ $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
+ list /data directory in dfs with conf specified in hadoop-site.xml
+
+ $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt darwin:50020 -submit job.xml
+ submit a job to job tracker darwin:50020
+
+ $ bin/hadoop job -jt local -submit job.xml
+ submit a job to local runner
+
+ $ bin/hadoop jar -libjars testlib.jar
+ -archives test.tgz -files file.txt inputjar args
+ job submission with libjars, files and archives
+
+
+ @see Tool
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+ Class<T>) of the
+ argument of type T.
+ @param The type of the argument
+ @param t the object to get it class
+ @return Class<T>]]>
+
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param c the Class object of the items in the list
+ @param list the list to convert]]>
+
+
+
+
+
+ List<T> to a an array of
+ T[].
+ @param list the list to convert
+ @throws ArrayIndexOutOfBoundsException if the list is empty.
+ Use {@link #toArray(Class, List)} if the list may be empty.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ io.file.buffer.size specified in the given
+ Configuration.
+ @param in input stream
+ @param conf configuration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if native-hadoop is loaded,
+ else false]]>
+
+
+
+
+
+ true if native hadoop libraries, if present, can be
+ used for this job; false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ { pq.top().change(); pq.adjustTop(); }
+ instead of
+ { o = pq.pop(); o.change(); pq.push(o); }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take an insignificant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Class is to be obtained
+ @return the correctly typed Class of the given object.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop Pipes
+ or Hadoop Streaming.
+
+ It also checks to ensure that we are running on a *nix platform else
+ (e.g. in Cygwin/Windows) it returns null.
+ @param conf configuration
+ @return a String[] with the ulimit command arguments or
+ null if we are running on a non *nix platform or
+ if the limit is unspecified.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Shell can be used to run unix commands like du or
+ df. It also offers facilities to gate commands by
+ time-intervals.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ShellCommandExecutorshould be used in cases where the output
+ of the command needs no explicit parsing and where the command, working
+ directory and the environment remains unchanged. The output of the command
+ is stored as-is and is expected to be small.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ArrayList of string values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the char to be escaped
+ @return an escaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ charToEscape in the string
+ with the escape char escapeChar
+
+ @param str string
+ @param escapeChar escape char
+ @param charToEscape the escaped char
+ @return an unescaped string]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+
+
Here is how a typical Tool is implemented:
+
+ public class MyApp extends Configured implements Tool {
+
+ public int run(String[] args) throws Exception {
+ // Configuration processed by ToolRunner
+ Configuration conf = getConf();
+
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyApp.MyMapper.class);
+ job.setReducerClass(MyApp.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ }
+
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new Sort(), args);
+
+ System.exit(res);
+ }
+ }
+
+
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration, or builds one if null.
+
+ Sets the Tool's configuration with the possibly modified
+ version of the conf.
+
+ @param conf Configuration for the Tool.
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+ Tool with its Configuration.
+
+ Equivalent to run(tool.getConf(), tool, args).
+
+ @param tool Tool to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+
+
+
+
+
+
+
+
+
+ ToolRunner can be used to run classes implementing
+ Tool interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+
+ generic hadoop command line arguments and modifies the
+ Configuration of the Tool. The
+ application-specific options are passed along without being modified.
+
+
+ @see Tool
+ @see GenericOptionsParser]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Bloom filter, as defined by Bloom in 1970.
+
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this counting Bloom filter.
+
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+
+
+
+
+
+
+
+
+
+
+
+ key -> count map.
+
NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15.
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count with probability equal to the
+ error rate of this filter, and v > count otherwise.
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count with the probability of the false negative rate of such
+ filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+
+ A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but
+ each of the s rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr elements are recorded in the
+ initial bit vector, where nr <= n (n is
+ the cardinality of the set A to record in the filter).
+
+ As the size of A grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr, is
+ strictly less than the current cardinality of A, n.
+ If an active Bloom filter is found, the key is inserted and
+ nr is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A and the element
+ is added in this new Bloom filter and the nr value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k positions are set to one in one of the matrix rows.
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash functions to consider.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+
+ this filter.
+ @param key The key to add.]]>
+
+
+
+
+
+ this filter.
+ @param key The key to test.
+ @return boolean True if the specified key belongs to this filter.
+ False otherwise.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to AND with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to OR with.]]>
+
+
+
+
+
+ this filter and a specified filter.
+
+ Invariant: The result is assigned to this filter.
+ @param filter The filter to XOR with.]]>
+
+
+
+
+ this filter.
+
+ The result is assigned to this filter.]]>
+
+
+
+
+
+ this filter.
+ @param keys The list of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The collection of keys.]]>
+
+
+
+
+
+ this filter.
+ @param keys The array of keys.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A filter is a data structure which aims at offering a lossy summary of a set A. The
+ key idea is to map entries of A (also called keys) into several positions
+ in a vector through the use of several hash functions.
+
+ Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).
+
+ It must be extended in order to define the real behavior.
+
+ @see Key The general behavior of a key
+ @see HashFunction A hash function]]>
+
+
+
+
+
+
+
+
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+
+
+
+
+ this hash function. A NOOP]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Builds a key with a default weight.
+ @param value The byte value of this key.]]>
+
+
+
+
+
+ Builds a key with a specified weight.
+ @param value The value of this key.
+ @param weight The weight associated to this key.]]>
+
+
+
+
+
+
+
+
+
+
+
+ this key.]]>
+
+
+
+
+ this key.]]>
+
+
+
+
+
+ this key with a specified value.
+ @param weight The increment.]]>
+
+
+
+
+ this key by one.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The idea is to randomly select a bit to reset.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+
+
+
+
+
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+
+
+
+
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+
+
+
+
+
+
+
+
+ this retouched Bloom filter.
+
+ Invariant: if the false positive is null, nothing happens.
+ @param key The false positive key to add.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+
+
+
+
+
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+
+
+
+
+
+
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+
+
+
+
+
+
+
+
+
+
+
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ length, and
+ the provided seed value
+ @param bytes input bytes
+ @param length length of the valid bytes to consider
+ @param initval seed value
+ @return hash value]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The best hash table sizes are powers of 2. There is no need to do mod
+ a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.
+ For example, if you need only 10 bits, do
+ h = (h & hashmask(10));
+ In which case, the hash table should have hashsize(10) elements.
+
+
If you are hashing n strings byte[][] k, do it like this:
+ for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);
+
+
By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this
+ code any way you wish, private, educational, or commercial. It's free.
+
+
Use for hash table lookup, or anything where one collision in 2^^32 is
+ acceptable. Do NOT use for cryptographic purposes.]]>
+
+
+
+
+
+
+
+
+
+
+ lookup3.c, by Bob Jenkins, May 2006, Public Domain.
+
+ You can use this free for any purpose. It's in the public domain.
+ It has no warranty.
+
+
+ @see lookup3.c
+ @see Hash Functions (and how this
+ function compares to others such as CRC, MD?, etc
+ @see Has update on the
+ Dr. Dobbs Article]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The C version of MurmurHash 2.0 found at that site was ported
+ to Java by Andrzej Bialecki (ab at getopt org).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker,
+ as {@link JobTracker.State}
+
+ @return the current state of the JobTracker.]]>
+
+
+
+
+ JobTracker
+
+ @return the size of heap memory used by the JobTracker]]>
+
+
+
+
+ JobTracker
+
+ @return the configured size of max heap memory that can be used by the JobTracker]]>
+
+
+
+
+
+
+
+
+
+
+
+ ClusterStatus provides clients with information such as:
+
+
+ Size of the cluster.
+
+
+ Name of the trackers.
+
+
+ Task capacity of the cluster.
+
+
+ The number of currently running map & reduce tasks.
+
+
+ State of the JobTracker.
+
+
+
+
Clients can query for the latest ClusterStatus, via
+ {@link JobClient#getClusterStatus()}.
Counters are bunched into {@link Group}s, each comprising of
+ counters from a particular Enum class.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Counters} instead.]]>
+
Grouphandles localization of the class name and the
+ counter names.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(FileSystem, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.
+ @deprecated Use {@link org.apache.hadoop.mapreduce.lib.input.FileInputFormat}
+ instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the job output should be compressed,
+ false otherwise]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Tasks' Side-Effect Files
+
+
Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter is not
+ a FileOutputCommitter, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapred.output.dir}$
+
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in ${mapred.work.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
Note: the value of ${mapred.work.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapred.output.dir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+
+
The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+ It is the responsibility of the RecordReader to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+ InputSplit.
+
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+
+
+
+
+
+ InputSplit is
+ located as an array of Strings.
+ @throws IOException]]>
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader
+ @deprecated Use {@link org.apache.hadoop.mapreduce.InputSplit} instead.]]>
+
+ Checking the input and output specifications of the job.
+
+
+ Computing the {@link InputSplit}s for the job.
+
+
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+
+
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+
+
+ Submitting the job to the JobTracker and optionally monitoring
+ it's status.
+
+
+
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient to submit
+ the job and monitor its progress.
+
+
Here is an example on how to use JobClient:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+
+
+
Job Control
+
+
At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+
+
However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+
+
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+
+
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+
+
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+
+
+
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+
+ @param loadDefaults specifies whether to load from the default files]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if framework should keep the intermediate files
+ for failed tasks, false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the outputs of the maps are to be compressed,
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+
+
For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+
+
Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+
+
Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator.
+ @see #setOutputKeyComparatorClass(Class)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+
+ @return the user-defined combiner class used to combine map-outputs.]]>
+
+
+
+
+
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+
+
The combiner is an application-specified aggregation operation, which
+ can help cut down the amount of data transferred between the
+ {@link Mapper} and the {@link Reducer}, leading to better performance.
+
+
The framework may invoke the combiner 0, 1, or multiple times, in both
+ the mapper and reducer tasks. In general, the combiner is called as the
+ sort/merge result is written to disk. The combiner must:
+
+
be side-effect free
+
have the same input and output key types and the same input and
+ output value types
+
+
+
Typically the combiner is same as the Reducer for the
+ job i.e. {@link #setReducerClass(Class)}.
+
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on, else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be
+ used for this job for map tasks,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for map tasks,
+ else false.]]>
+
+
+
+
+ true.
+
+ @return true if speculative execution be used
+ for reduce tasks for this job,
+ false otherwise.]]>
+
+
+
+
+
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false.]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+
+
How many maps?
+
+
The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+
+
The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+
+
The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+
+
+
+
+ 1.
+
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+ How many reduces?
+
+
With 0.95 all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+
+
Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+
+
The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+
+
Reducer NONE
+
+
It is legal to set the number of reduce-tasks to zero.
+
+
In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+
+ @param n the number of reduce tasks for this job.]]>
+
+
+
+
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per map task.]]>
+
+
+
+
+
+
+
+
+
+
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+
+ @return the max number of attempts per reduce task.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ noFailures, the
+ tasktracker is blacklisted for this job.
+
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ blacklisted for this job.
+
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+ failed.
+
+ Defaults to zero, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+ failed.
+
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+
+ @param mDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+
+
The debug command, run on the node where the map failed, is:
+
+ $script $stdout $stderr $syslog $jobconf.
+
+
+
The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+
+ @param rDbgScript the script name]]>
+
+
+
+
+
+
+
+
+
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+
+
+
+
+
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+
+
This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+
+ @param uri the job end notification uri
+ @see JobStatus
+ @see Job Completion and Chaining]]>
+
+
+
+
+
+ When a job starts, a shared directory is created at location
+
+ ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ .
+ This directory is exposed to the users through
+ job.local.dir .
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+
+ @return The localized job specific shared directory]]>
+
+
+
+
+
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to a value different
+ from {@link #DISABLED_MEMORY_LIMIT}, that value will be used
+ after converting it from bytes to MB.
+ @return memory required to run a map task of the job, in MB,
+ or {@link #DISABLED_MEMORY_LIMIT} if unset.]]>
+
+
+
+
+
+
+
+
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to a value different
+ from {@link #DISABLED_MEMORY_LIMIT}, that value will be used
+ after converting it from bytes to MB.
+ @return memory required to run a reduce task of the job, in MB,
+ or {@link #DISABLED_MEMORY_LIMIT} if unset.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This method is deprecated. Now, different memory limits can be
+ set for map and reduce tasks of a job, in MB.
+
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to a value different
+ from {@link #DISABLED_MEMORY_LIMIT}, that value is returned.
+ Otherwise, this method will return the larger of the values returned by
+ {@link #getMemoryForMapTask()} and {@link #getMemoryForReduceTask()}
+ after converting them into bytes.
+
+ @return Memory required to run a task of this job, in bytes,
+ or {@link #DISABLED_MEMORY_LIMIT}, if unset.
+ @see #setMaxVirtualMemoryForTask(long)
+ @deprecated Use {@link #getMemoryForMapTask()} and
+ {@link #getMemoryForReduceTask()}]]>
+
+
+
+
+
+
+ mapred.task.maxvmem is split into
+ mapred.job.map.memory.mb
+ and mapred.job.map.memory.mb,mapred
+ each of the new key are set
+ as mapred.task.maxvmem / 1024
+ as new values are in MB
+
+ @param vmem Maximum amount of virtual memory in bytes any task of this job
+ can use.
+ @see #getMaxVirtualMemoryForTask()
+ @deprecated
+ Use {@link #setMemoryForMapTask(long mem)} and
+ Use {@link #setMemoryForReduceTask(long mem)}]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobConf is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf, however:
+
+
+ Some configuration parameters might have been marked as
+
+ final by administrators and hence cannot be altered.
+
+
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ rest of the framework and/or job-configuration and is relatively more
+ complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).
+
+
+
+
JobConf typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
+
+
Optionally JobConf is used to specify other advanced facets
+ of the job such as Comparators to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+
+
Here is an example on how to configure a job via JobConf:
+
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID]]>
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit for that task.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator via
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+
+
The grouped Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf.
+
+
If the job has
+ zero
+ reduces then the output of the Mapper is directly written
+ to the {@link FileSystem} without grouping by keys.
+
+
Example:
+
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+
+ public void configure(JobConf job) {
+ mapTaskId = job.get("mapred.task.id");
+ inputFile = job.get("map.input.file");
+ }
+
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+
+ // Process some more
+ // ...
+ // ...
+
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+
+
+
Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mappers etc.
Mapping of input records to output records is complete when this method
+ returns.
+
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+
+
+
+ Custom implementations of MapRunnable can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+
+ @see Mapper
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Mapper} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader's for MultiFileSplit's.
+ @see MultiFileSplit
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileInputFormat} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat
+ @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileSplit} instead]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+
+
OutputCollector is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper or the Reducer i.e. intermediate outputs
+ or the output of the job.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputCommitter} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter
+ @see JobConf
+ @deprecated Use {@link org.apache.hadoop.mapreduce.OutputFormat} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if there exists a key/value,
+ false otherwise.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RawKeyValueIterator is an iterator used to iterate over
+ the raw keys and values during sort/merge of intermediate data.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 0.0 to 1.0.
+ @throws IOException]]>
+
+
+
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+
+
RecordReader, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit, and presents a
+ record-oriented view for the {@link Mapper} & {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reduces values for a given key.
+
+
The framework calls this method for each
+ <key, (list of values)> pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+
+
+
Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+
+
Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes an insignificant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+
+ mapred.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+
+
+
+ The number of Reducers for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer implementations
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
Reducer is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer, fetches the
+ relevant partition of the output of all the Mappers, via HTTP.
+
+
+
+
+
Sort
+
+
The framework groups Reducer inputs by keys
+ (since different Mappers may have output the same key) in this
+ stage.
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator via
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)> pair in
+ the grouped inputs.
+
The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+
+ static enum MyCounters { NUM_RECORDS }
+
+ private String reduceTaskId;
+ private int noKeys = 0;
+
+ public void configure(JobConf job) {
+ reduceTaskId = job.get("mapred.task.id");
+ }
+
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+
+ // Increment the no. of values for this key
+ ++noValues;
+
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+
+ // Process some more
+ // ...
+ // ...
+
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+
+
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase
+ @deprecated Use {@link org.apache.hadoop.mapreduce.Reducer} instead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Counter of the given group/name.]]>
+
+
+
+
+
+
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+
+
+
+
+
+
+
+
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes an insignificant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+
+
Applications can also update {@link Counters} via the provided
+ Reporter .
+
+ @see Progressable
+ @see Counters]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+
+
Clients can get hold of RunningJob via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+
+ @see JobClient]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f. The filtering criteria is
+ MD5(key) % f == 0.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ f using
+ the criteria record# % f == 0.
+ For example, if the frequency is 10, one out of 10 records is returned.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ true if auto increment
+ {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}.
+ false otherwise.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+
+
This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+
+
With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+
+
The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+
+
In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+
+
+
+
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0 , which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+
+
+
+
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+
+
+
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ [,,...,]]]>
+
+
+
+
+
+
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+
+
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+
+ ChainMapper usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Reducer the configuration given for it,
+ reducerConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+
+ For the added Mapper the configuration given for it,
+ mapperConf, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults) constructor with FALSE.]]>
+
+
+
+
+
+
+ If this method is overriden super.configure(...) should be
+ invoked at the beginning of the overwriter method.]]>
+
+
+
+
+
+
+
+
+
+ reduce(...) method of the Reducer with the
+ map(...) methods of the Mappers in the chain.]]>
+
+
+
+
+
+
+ If this method is overriden super.close() should be
+ invoked at the end of the overwriter method.]]>
+
+
+
+
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+
+ ChainReducer usage pattern:
+
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordReader's for CombineFileSplit's.
+ @see CombineFileSplit]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+ th Path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ CombineFileSplit can be used to implement {@link org.apache.hadoop.mapred.RecordReader}'s,
+ with reading one record per file.
+ @see org.apache.hadoop.mapred.FileSplit
+ @see CombineFileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ @param freq The frequency with which records will be emitted.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ This will read every split at the client, which is very expensive.
+ @param freq Probability with which a key will be chosen.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ all splits.
+ Takes the first numSamples / numSplits records from each split.
+ @param numSamples Total number of samples to obtain from all selected
+ splits.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+
+
+
+
+ @param conf job conf to add the named output
+ @param namedOutput named output name, it has to be a word, letters
+ and numbers only, cannot be the word 'part' as
+ that is reserved for the
+ default output.
+ @param outputFormatClass OutputFormat class.
+ @param keyClass key class
+ @param valueClass value class]]>
+
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+
+
+
+
+
+
+ By default these counters are disabled.
+
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+
+
+
+
+ @param namedOutput the named output name
+ @param multiName the multi name part
+ @param reporter the reporter
+ @return the output collector for the given named output
+ @throws IOException thrown if output collector could not be created]]>
+
+
+
+
+
+
+ If overriden subclasses must invoke super.close() at the
+ end of their close()
+
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+
+
+
+ OutputCollector passed to
+ the map() and reduce() methods of the
+ Mapper and Reducer implementations.
+
+ Each additional output, or named output, may be configured with its own
+ OutputFormat, with its own key class and with its own value
+ class.
+
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat, key class and value class configuration.
+
+ When named outputs are used within a Mapper implementation,
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector are part of the
+ reduce phase.
+
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+
+ Job configuration usage pattern is:
+
+
+ JobConf conf = new JobConf();
+
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+
+ ...
+
+
+ Job configuration usage pattern is:
+
+
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+
+ }
+
]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Map implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of thread the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads property, its default
+ value is 10 threads.
+
+ Alternatively, the properties can be set in the configuration with proper
+ values.
+
+ @see DBConfiguration#configureDB(JobConf, String, String, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String)
+ @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...)
+ @see DBOutputFormat#setOutput(JobConf, String, String...)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 20070101 AND length > 0)'
+ @param orderBy the fieldNames in the orderBy clause.
+ @param fieldNames The field names in the table
+ @see #setInput(JobConf, Class, String, String)]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+
+
Example:
+ If we have the following table in the database :
+
+ CREATE TABLE MyTable (
+ counter INTEGER NOT NULL,
+ timestamp BIGINT NOT NULL,
+ );
+
+ then we can read/write the tuples from/to the table with :
+
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+
Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple. The InputFormat
+ also creates the {@link RecordReader} to read the {@link InputSplit}.
+
+ @param context job configuration.
+ @return an array of {@link InputSplit}s for the job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the InputFormat of the
+ job to:
+
+
+ Validate the input-specification of the job.
+
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+
+
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical InputSplit for processing by
+ the {@link Mapper}.
+
+
+
+
The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+
+ mapred.min.split.size.
+
+
Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibility to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit to the individual task.
+
+ @see InputSplit
+ @see RecordReader
+ @see FileInputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+
+
Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+
+ @see InputFormat
+ @see RecordReader]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ InputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ OutputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Mapper to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+ Partitioner to use
+ @throws IllegalStateException if the job is submitted]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job is complete, else false.
+ @throws IOException]]>
+
+
+
+
+
+ true if the job succeeded, else false.
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ JobTracker is lost]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1.
+ @return the number of reduce tasks for this job.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example JobID is :
+ job_200707121733_0003 , which represents the third job
+ running at the jobtracker started at 200707121733.
+
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+
+ @see TaskID
+ @see TaskAttemptID
+ @see org.apache.hadoop.mapred.JobTracker#getNewJobId()
+ @see org.apache.hadoop.mapred.JobTracker#getStartTime()]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the key input type to the Mapper
+ @param the value input type to the Mapper
+ @param the key output type from the Mapper
+ @param the value output type from the Mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+
+
The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper implementations can access the {@link Configuration} for
+ the job via the {@link JobContext#getConfiguration()}.
+
+
The framework first calls
+ {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
+ {@link #map(Object, Object, Context)}
+ for each key/value pair in the InputSplit. Finally
+ {@link #cleanup(Context)} is called.
+
+
All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the sorting and grouping by
+ specifying two key {@link RawComparator} classes.
+
+
The Mapper outputs are partitioned per
+ Reducer. Users can control which keys (and hence records) go to
+ which Reducer by implementing a custom {@link Partitioner}.
+
+
Users can optionally specify a combiner, via
+ {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper to the Reducer.
+
+
Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the Configuration.
+
+
If the job has zero
+ reduces then the output of the Mapper is directly written
+ to the {@link OutputFormat} without sorting by keys.
+
+
Example:
+
+ public class TokenCounterMapper
+ extends Mapper
+
+
Applications may override the {@link #run(Context)} method to exert
+ greater control on map processing e.g. multi-threaded Mappers
+ etc.
The Map-Reduce framework relies on the OutputCommitter of
+ the job to:
+
+
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+
+
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+
+
+ Setup the task temporary output.
+
+
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+
+
+ Commit of the task output.
+
+
+ Discard the task commit.
+
+
+
+ @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+
+ @param context information about the job
+ @throws IOException when output should not be attempted]]>
+
+
+
+
+
+
+
+
+
+
+
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+
+
The Map-Reduce framework relies on the OutputFormat of the
+ job to:
+
+
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+
+
+
+ @see RecordWriter]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ Typically a hash function on a all or a subset of the key.
+
+ @param key the key to be partioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key.]]>
+
+
+
+ Partitioner controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+
+ @see Reducer]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @param ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ RecordWriter to future operations.
+
+ @param context the context of the task
+ @throws IOException]]>
+
+
+
+ RecordWriter writes the output <key, value> pairs
+ to an output file.
+
+
RecordWriter implementations write the job outputs to the
+ {@link FileSystem}.
+
+ @see OutputFormat]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the class of the input keys
+ @param the class of the input values
+ @param the class of the output keys
+ @param the class of the output values]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Reducer implementations
+ can access the {@link Configuration} for the job via the
+ {@link JobContext#getConfiguration()} method.
+
+
Reducer has 3 primary phases:
+
+
+
+
Shuffle
+
+
The Reducer copies the sorted output from each
+ {@link Mapper} using HTTP across the network.
+
+
+
+
Sort
+
+
The framework merge sorts Reducer inputs by
+ keys
+ (since different Mappers may have output the same key).
+
+
The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+
+
SecondarySort
+
+
To achieve a secondary sort on the values returned by the value
+ iterator, the application should extend the key with the secondary
+ key and define a grouping comparator. The keys will be sorted using the
+ entire key, but will be grouped using the grouping comparator to decide
+ which keys and values are sent in the same call to reduce.The grouping
+ comparator is specified via
+ {@link Job#setGroupingComparatorClass(Class)}. The sort order is
+ controlled by
+ {@link Job#setSortComparatorClass(Class)}.
+
+
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+
+
Map Input Key: url
+
Map Input Value: document
+
Map Output Key: document checksum, url pagerank
+
Map Output Value: url
+
Partitioner: by checksum
+
OutputKeyComparator: by checksum and then decreasing pagerank
+
OutputValueGroupingComparator: by checksum
+
+
+
+
+
Reduce
+
+
In this phase the
+ {@link #reduce(Object, Iterable, Context)}
+ method is called for each <key, (collection of values)> in
+ the sorted inputs.
+
The output of the reduce task is typically written to a
+ {@link RecordWriter} via
+ {@link Context#write(Object, Object)}.
+
+
+
+
The output of the Reducer is not re-sorted.
+
+
Example:
+
+ public class IntSumReducer extends Reducer {
+ private IntWritable result = new IntWritable();
+
+ public void reduce(Key key, Iterable values,
+ Context context) throws IOException {
+ int sum = 0;
+ for (IntWritable val : values) {
+ sum += val.get();
+ }
+ result.set(sum);
+ context.collect(key, result);
+ }
+ }
+
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ An example TaskID is :
+ task_200707121733_0003_m_000005 , which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733.
+
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+
+ @see JobID
+ @see TaskAttemptID]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the input key type for the task
+ @param the input value type for the task
+ @param the output key type for the task
+ @param the output value type for the task]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat implementations can override this and return
+ false to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+
+ @param context the job context
+ @param filename the file name to check
+ @return is this file splitable?]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ FileInputFormat is the base class for all file-based
+ InputFormats. This provides a generic implementation of
+ {@link #getSplits(JobContext)}.
+ Subclasses of FileInputFormat can also override the
+ {@link #isSplitable(JobContext, Path)} method to ensure input-files are
+ not split-up and are processed as a whole by {@link Mapper}s.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ the map's input key type
+ @param the map's input value type
+ @param the map's output key type
+ @param the map's output value type
+ @param job the job
+ @return the mapper class to run]]>
+
+
+
+
+
+
+ the map input key type
+ @param the map input value type
+ @param the map output key type
+ @param the map output value type
+ @param job the job to modify
+ @param cls the class to use as the mapper]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ It can be used instead of the default implementation,
+ @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU
+ bound in order to improve throughput.
+
+ Mapper implementations using this MapRunnable must be thread-safe.
+
+ The Map-Reduce job has to be configured with the mapper to use via
+ {@link #setMapperClass(Configuration, Class)} and
+ the number of thread the thread-pool can use with the
+ {@link #getNumberOfThreads(Configuration) method. The default
+ value is 10 threads.
+
Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
+
+
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+
+
To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapred.output.dir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapred.output.dir}/_temporary/_${taskid} (only)
+ are promoted to ${mapred.output.dir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+
+
The application-writer can take advantage of this by creating any
+ side-files required in a work directory during execution
+ of his task i.e. via
+ {@link #getWorkOutputPath(TaskInputOutputContext)}, and
+ the framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+
+
The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+
+
+
+
+
+
+
+
+
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+
+
This method uses the {@link #getUniqueFile} method to make the file name
+ unique for the task.
The most important difference is that unlike GFS, Hadoop DFS files
+have strictly one writer at any one time. Bytes are always appended
+to the end of the writer's stream. There is no notion of "record appends"
+or "mutations" that are then checked or reordered. Writers simply emit
+a byte stream. That byte stream is guaranteed to be stored in the
+order written.
+ The client will then have to contact
+ one of the indicated DataNodes to obtain the actual data.
+
+ @param src file name
+ @param offset range start offset
+ @param length range length
+ @return file length and array of blocks with their locations
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ This will create an empty file specified by the source path.
+ The path should reflect a full path originated at the root.
+ The name-node does not have a notion of "current" directory for a client.
+
+ Once created, the file is visible and available for read to other clients.
+ Although, other clients cannot {@link #delete(String)}, re-create or
+ {@link #rename(String, String)} it until the file is completed
+ or explicitly as a result of lease expiration.
+
+ Blocks have a maximum size. Clients that intend to
+ create multi-block files must also use {@link #addBlock(String, String)}.
+
+ @param src path of the file being created.
+ @param masked masked permission.
+ @param clientName name of the current client.
+ @param overwrite indicates whether the file should be
+ overwritten if it already exists.
+ @param replication block replication factor.
+ @param blockSize maximum block size.
+
+ @throws AccessControlException if permission to create file is
+ denied by the system. As usually on the client side the exception will
+ be wrapped into {@link org.apache.hadoop.ipc.RemoteException}.
+ @throws QuotaExceededException if the file creation violates
+ any quota restriction
+ @throws IOException if other errors occur.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The NameNode sets replication to the new value and returns.
+ The actual block replication is not expected to be performed during
+ this method call. The blocks will be populated or removed in the
+ background as the result of the routine block maintenance procedures.
+
+ @param src file name
+ @param replication new replication
+ @throws IOException
+ @return true if successful;
+ false if file does not exist or is a directory]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Any blocks belonging to the deleted files will be garbage-collected.
+
+ @param src existing name.
+ @return true only if the existing file or directory was actually removed
+ from the file system.]]>
+
+
+
+
+
+
+
+
+ same as delete but provides a way to avoid accidentally
+ deleting non empty directories programmatically.
+ @param src existing name
+ @param recursive if true deletes a non empty directory recursively,
+ else throws an exception.
+ @return true only if the existing file or directory was actually removed
+ from the file system.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ So, the NameNode will revoke the locks and live file-creates
+ for clients that it thinks have died. A client tells the
+ NameNode that it is still alive by periodically calling
+ renewLease(). If a certain amount of time passes since
+ the last call to renewLease(), the NameNode assumes the
+ client has died.]]>
+
+
+
+
+
+
+
[0] contains the total storage capacity of the system, in bytes.
+
[1] contains the total used space of the system, in bytes.
+
[2] contains the available storage of the system, in bytes.
+
[3] contains number of under replicated blocks in the system.
+
[4] contains number of blocks with a corrupt replica.
+
[5] contains number of blocks without any good replicas left.
+
+ Use public constants like {@link #GET_STATS_CAPACITY_IDX} in place of
+ actual numbers to index into the array.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Safe mode is a name node state when it
+
does not accept changes to name space (read-only), and
+
does not replicate or delete blocks.
+
+
+ Safe mode is entered automatically at name node startup.
+ Safe mode can also be entered manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}.
+
+ At startup the name node accepts data node reports collecting
+ information about block locations.
+ In order to leave safe mode it needs to collect a configurable
+ percentage called threshold of blocks, which satisfy the minimal
+ replication condition.
+ The minimal replication condition is that each block must have at least
+ dfs.replication.min replicas.
+ When the threshold is reached the name node extends safe mode
+ for a configurable amount of time
+ to let the remaining data nodes to check in before it
+ will start replicating missing blocks.
+ Then the name node leaves safe mode.
+
+ If safe mode is turned on manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_ENTER)}
+ then the name node stays in safe mode until it is manually turned off
+ using {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_LEAVE)}.
+ Current state of the name node can be verified using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}
+
Configuration parameters:
+ dfs.safemode.threshold.pct is the threshold parameter.
+ dfs.safemode.extension is the safe mode extension parameter.
+ dfs.replication.min is the minimal replication parameter.
+
+
Special cases:
+ The name node does not enter safe mode at startup if the threshold is
+ set to 0 or if the name space is empty.
+ If the threshold is set to 1 then all blocks need to have at least
+ minimal replication.
+ If the threshold value is greater than 1 then the name node will not be
+ able to turn off safe mode automatically.
+ Safe mode can always be turned off manually.
+
+ @param action
+ To start:
+ bin/start-balancer.sh [-threshold ]
+ Example: bin/ start-balancer.sh
+ start the balancer with a default threshold of 10%
+ bin/ start-balancer.sh -threshold 5
+ start the balancer with a threshold of 5%
+ To stop:
+ bin/ stop-balancer.sh
+
+
+
DESCRIPTION
+
The threshold parameter is a fraction in the range of (0%, 100%) with a
+ default value of 10%. The threshold sets a target for whether the cluster
+ is balanced. A cluster is balanced if for each datanode, the utilization
+ of the node (ratio of used space at the node to total capacity of the node)
+ differs from the utilization of the (ratio of used space in the cluster
+ to total capacity of the cluster) by no more than the threshold value.
+ The smaller the threshold, the more balanced a cluster will become.
+ It takes more time to run the balancer for small threshold values.
+ Also for a very small threshold the cluster may not be able to reach the
+ balanced state when applications write and delete files concurrently.
+
+
The tool moves blocks from highly utilized datanodes to poorly
+ utilized datanodes iteratively. In each iteration a datanode moves or
+ receives no more than the lesser of 10G bytes or the threshold fraction
+ of its capacity. Each iteration runs no more than 20 minutes.
+ At the end of each iteration, the balancer obtains updated datanodes
+ information from the namenode.
+
+
A system property that limits the balancer's use of bandwidth is
+ defined in the default configuration file:
+
+
+ dfs.balance.bandwidthPerSec
+ 1048576
+ Specifies the maximum bandwidth that each datanode
+ can utilize for the balancing purpose in term of the number of bytes
+ per second.
+
+
+
+
This property determines the maximum speed at which a block will be
+ moved from one datanode to another. The default value is 1MB/s. The higher
+ the bandwidth, the faster a cluster can reach the balanced state,
+ but with greater competition with application processes. If an
+ administrator changes the value of this property in the configuration
+ file, the change is observed when HDFS is next restarted.
+
+
MONITERING BALANCER PROGRESS
+
After the balancer is started, an output file name where the balancer
+ progress will be recorded is printed on the screen. The administrator
+ can monitor the running of the balancer by reading the output file.
+ The output shows the balancer's status iteration by iteration. In each
+ iteration it prints the starting time, the iteration number, the total
+ number of bytes that have been moved in the previous iterations,
+ the total number of bytes that are left to move in order for the cluster
+ to be balanced, and the number of bytes that are being moved in this
+ iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left
+ To Move" is decreasing.
+
+
Running multiple instances of the balancer in an HDFS cluster is
+ prohibited by the tool.
+
+
The balancer automatically exits when any of the following five
+ conditions is satisfied:
+
+
The cluster is balanced;
+
No block can be moved;
+
No block has been moved for five consecutive iterations;
+
An IOException occurs while communicating with the namenode;
+
Another balancer is running.
+
+
+
Upon exit, a balancer returns an exit code and prints one of the
+ following messages to the output file in corresponding to the above exit
+ reasons:
+
+
The cluster is balanced. Exiting
+
No block can be moved. Exiting...
+
No block has been moved for 3 iterations. Exiting...
+
Received an IO exception: failure reason. Exiting...
+
Another balancer is running. Exiting...
+
+
+
The administrator can interrupt the execution of the balancer at any
+ time by running the command "stop-balancer.sh" on the machine where the
+ balancer is running.]]>
+
+ Local storage can reside in multiple directories.
+ Each directory should contain the same VERSION file as the others.
+ During startup Hadoop servers (name-node and data-nodes) read their local
+ storage information from them.
+
+ The servers hold a lock for each storage directory while they run so that
+ other nodes were not able to startup sharing the same storage.
+ The locks are released when the servers stop (normally or abnormally).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Removes contents of the current directory and creates an empty directory.
+
+ This does not fully format storage directory.
+ It cannot write the version file since it should be written last after
+ all other storage type dependent files are written.
+ Derived storage is responsible for setting specific storage values and
+ writing the version file to disk.
+
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Locking is not supported by all file systems.
+ E.g., NFS does not consistently support exclusive locks.
+
+
If locking is supported we guarantee exculsive access to the
+ storage directory. Otherwise, no guarantee is given.
+
+ @throws IOException if locking fails]]>
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
+ Finally, the namenode returns its namespaceID as the registrationID
+ for the datanodes.
+ namespaceID is a persistent attribute of the name space.
+ The registrationID is checked every time the datanode is communicating
+ with the namenode.
+ Datanodes with inappropriate registrationID are rejected.
+ If the namenode stops, and then restarts it can restore its
+ namespaceID and will continue serving the datanodes that has previously
+ registered with the namenode without restarting the whole cluster.
+
+ @see org.apache.hadoop.hdfs.server.datanode.DataNode#register()]]>
+
{@link StartupOption#REGULAR REGULAR} - normal name node startup
+
{@link StartupOption#FORMAT FORMAT} - format name node
+
{@link StartupOption#UPGRADE UPGRADE} - start the cluster
+ upgrade and create a snapshot of the current file system state
+
{@link StartupOption#ROLLBACK ROLLBACK} - roll the
+ cluster back to the previous state
+
+ The option is passed via configuration field:
+ dfs.namenode.startup
+
+ The conf will be modified to reflect the actual ports on which
+ the NameNode is up and running if the user passes the port as
+ zero in the conf.
+
+ @param conf confirguration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ datanode whose
+ total size is size
+
+ @param datanode on which blocks are located
+ @param size total size of blocks]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ blocksequence (namespace)
+ 2) block->machinelist ("inodes")
+
+ The first table is stored on disk and is very precious.
+ The second table is rebuilt every time the NameNode comes
+ up.
+
+ 'NameNode' refers to both this class as well as the 'NameNode server'.
+ The 'FSNamesystem' class actually performs most of the filesystem
+ management. The majority of the 'NameNode' class itself is concerned
+ with exposing the IPC interface and the http server to the outside world,
+ plus some configuration management.
+
+ NameNode implements the ClientProtocol interface, which allows
+ clients to ask for DFS services. ClientProtocol is not
+ designed for direct use by authors of DFS client code. End-users
+ should instead use the org.apache.nutch.hadoop.fs.FileSystem class.
+
+ NameNode also implements the DatanodeProtocol interface, used by
+ DataNode programs that actually store DFS data blocks. These
+ methods are invoked repeatedly and automatically by all the
+ DataNodes in a DFS deployment.
+
+ NameNode also implements the NamenodeProtocol interface, used by
+ secondary namenodes or rebalancing processes to get partial namenode's
+ state, for example partial blocksMap etc.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link #FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link #FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
delete corrupted files ({@link #FIXING_DELETE})
+
+
+
detect files with under-replicated or over-replicated blocks
+
+
+
+
+
+
+
+
+
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
move corrupted files to /lost+found directory on DFS
+ ({@link org.apache.hadoop.hdfs.server.namenode.NamenodeFsck#FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
The most important difference is that unlike GFS, Hadoop DFS files
+have strictly one writer at any one time. Bytes are always appended
+to the end of the writer's stream. There is no notion of "record appends"
+or "mutations" that are then checked or reordered. Writers simply emit
+a byte stream. That byte stream is guaranteed to be stored in the
+order written.
+ The client will then have to contact
+ one of the indicated DataNodes to obtain the actual data.
+
+ @param src file name
+ @param offset range start offset
+ @param length range length
+ @return file length and array of blocks with their locations
+ @throws IOException
+ @throws UnresolvedLinkException if the path contains a symlink.
+ @throws FileNotFoundException if the path does not exist.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This will create an empty file specified by the source path.
+ The path should reflect a full path originated at the root.
+ The name-node does not have a notion of "current" directory for a client.
+
+ Once created, the file is visible and available for read to other clients.
+ Although, other clients cannot {@link #delete(String, boolean)}, re-create or
+ {@link #rename(String, String)} it until the file is completed
+ or explicitly as a result of lease expiration.
+
+ Blocks have a maximum size. Clients that intend to create
+ multi-block files must also use {@link #addBlock(String, String, Block, DatanodeInfo[])}.
+
+ @param src path of the file being created.
+ @param masked masked permission.
+ @param clientName name of the current client.
+ @param flag indicates whether the file should be
+ overwritten if it already exists or create if it does not exist or append.
+ @param createParent create missing parent directory if true
+ @param replication block replication factor.
+ @param blockSize maximum block size.
+
+ @throws AccessControlException if permission to create file is
+ denied by the system. As usually on the client side the exception will
+ be wrapped into {@link org.apache.hadoop.ipc.RemoteException}.
+ @throws QuotaExceededException if the file creation violates
+ any quota restriction
+ @throws IOException if other errors occur.
+ @throws UnresolvedLinkException if the path contains a symlink.
+ @throws AlreadyBeingCreatedException if the path does not exist.
+ @throws NSQuotaExceededException if the namespace quota is exceeded.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The NameNode sets replication to the new value and returns.
+ The actual block replication is not expected to be performed during
+ this method call. The blocks will be populated or removed in the
+ background as the result of the routine block maintenance procedures.
+
+ @param src file name
+ @param replication new replication
+ @throws IOException
+ @return true if successful;
+ false if file does not exist or is a directory
+ @throws UnresolvedLinkException if the path contains a symlink.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ Without OVERWRITE option, rename fails if the dst already exists.
+ With OVERWRITE option, rename overwrites the dst, if it is a file
+ or an empty directory. Rename fails if dst is a non-empty directory.
+
+ This implementation of rename is atomic.
+
+ @param src existing file or directory name.
+ @param dst new name.
+ @param options Rename options
+ @throws IOException if rename failed
+ @throws UnresolvedLinkException if the path contains a symlink.]]>
+
+
+
+
+
+
+
+
+ Any blocks belonging to the deleted files will be garbage-collected.
+
+ @param src existing name.
+ @return true only if the existing file or directory was actually removed
+ from the file system.
+ @throws UnresolvedLinkException if the path contains a symlink.
+ @deprecated use {@link #delete(String, boolean)} istead.]]>
+
+
+
+
+
+
+
+
+
+ same as delete but provides a way to avoid accidentally
+ deleting non empty directories programmatically.
+ @param src existing name
+ @param recursive if true deletes a non empty directory recursively,
+ else throws an exception.
+ @return true only if the existing file or directory was actually removed
+ from the file system.
+ @throws UnresolvedLinkException if the path contains a symlink.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ So, the NameNode will revoke the locks and live file-creates
+ for clients that it thinks have died. A client tells the
+ NameNode that it is still alive by periodically calling
+ renewLease(). If a certain amount of time passes since
+ the last call to renewLease(), the NameNode assumes the
+ client has died.
+ @throws UnresolvedLinkException if the path contains a symlink.]]>
+
+
+
+
+
+
+
[0] contains the total storage capacity of the system, in bytes.
+
[1] contains the total used space of the system, in bytes.
+
[2] contains the available storage of the system, in bytes.
+
[3] contains number of under replicated blocks in the system.
+
[4] contains number of blocks with a corrupt replica.
+
[5] contains number of blocks without any good replicas left.
+
+ Use public constants like {@link #GET_STATS_CAPACITY_IDX} in place of
+ actual numbers to index into the array.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Safe mode is a name node state when it
+
does not accept changes to name space (read-only), and
+
does not replicate or delete blocks.
+
+
+ Safe mode is entered automatically at name node startup.
+ Safe mode can also be entered manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}.
+
+ At startup the name node accepts data node reports collecting
+ information about block locations.
+ In order to leave safe mode it needs to collect a configurable
+ percentage called threshold of blocks, which satisfy the minimal
+ replication condition.
+ The minimal replication condition is that each block must have at least
+ dfs.namenode.replication.min replicas.
+ When the threshold is reached the name node extends safe mode
+ for a configurable amount of time
+ to let the remaining data nodes to check in before it
+ will start replicating missing blocks.
+ Then the name node leaves safe mode.
+
+ If safe mode is turned on manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_ENTER)}
+ then the name node stays in safe mode until it is manually turned off
+ using {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_LEAVE)}.
+ Current state of the name node can be verified using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}
+
Configuration parameters:
+ dfs.safemode.threshold.pct is the threshold parameter.
+ dfs.safemode.extension is the safe mode extension parameter.
+ dfs.namenode.replication.min is the minimal replication parameter.
+
+
Special cases:
+ The name node does not enter safe mode at startup if the threshold is
+ set to 0 or if the name space is empty.
+ If the threshold is set to 1 then all blocks need to have at least
+ minimal replication.
+ If the threshold value is greater than 1 then the name node will not be
+ able to turn off safe mode automatically.
+ Safe mode can always be turned off manually.
+
+ @param action
+ To start:
+ bin/start-balancer.sh [-threshold ]
+ Example: bin/ start-balancer.sh
+ start the balancer with a default threshold of 10%
+ bin/ start-balancer.sh -threshold 5
+ start the balancer with a threshold of 5%
+ To stop:
+ bin/ stop-balancer.sh
+
+
+
DESCRIPTION
+
The threshold parameter is a fraction in the range of (0%, 100%) with a
+ default value of 10%. The threshold sets a target for whether the cluster
+ is balanced. A cluster is balanced if for each datanode, the utilization
+ of the node (ratio of used space at the node to total capacity of the node)
+ differs from the utilization of the (ratio of used space in the cluster
+ to total capacity of the cluster) by no more than the threshold value.
+ The smaller the threshold, the more balanced a cluster will become.
+ It takes more time to run the balancer for small threshold values.
+ Also for a very small threshold the cluster may not be able to reach the
+ balanced state when applications write and delete files concurrently.
+
+
The tool moves blocks from highly utilized datanodes to poorly
+ utilized datanodes iteratively. In each iteration a datanode moves or
+ receives no more than the lesser of 10G bytes or the threshold fraction
+ of its capacity. Each iteration runs no more than 20 minutes.
+ At the end of each iteration, the balancer obtains updated datanodes
+ information from the namenode.
+
+
A system property that limits the balancer's use of bandwidth is
+ defined in the default configuration file:
+
+
+ dfs.balance.bandwidthPerSec
+ 1048576
+ Specifies the maximum bandwidth that each datanode
+ can utilize for the balancing purpose in term of the number of bytes
+ per second.
+
+
+
+
This property determines the maximum speed at which a block will be
+ moved from one datanode to another. The default value is 1MB/s. The higher
+ the bandwidth, the faster a cluster can reach the balanced state,
+ but with greater competition with application processes. If an
+ administrator changes the value of this property in the configuration
+ file, the change is observed when HDFS is next restarted.
+
+
MONITERING BALANCER PROGRESS
+
After the balancer is started, an output file name where the balancer
+ progress will be recorded is printed on the screen. The administrator
+ can monitor the running of the balancer by reading the output file.
+ The output shows the balancer's status iteration by iteration. In each
+ iteration it prints the starting time, the iteration number, the total
+ number of bytes that have been moved in the previous iterations,
+ the total number of bytes that are left to move in order for the cluster
+ to be balanced, and the number of bytes that are being moved in this
+ iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left
+ To Move" is decreasing.
+
+
Running multiple instances of the balancer in an HDFS cluster is
+ prohibited by the tool.
+
+
The balancer automatically exits when any of the following five
+ conditions is satisfied:
+
+
The cluster is balanced;
+
No block can be moved;
+
No block has been moved for five consecutive iterations;
+
An IOException occurs while communicating with the namenode;
+
Another balancer is running.
+
+
+
Upon exit, a balancer returns an exit code and prints one of the
+ following messages to the output file in corresponding to the above exit
+ reasons:
+
+
The cluster is balanced. Exiting
+
No block can be moved. Exiting...
+
No block has been moved for 3 iterations. Exiting...
+
Received an IO exception: failure reason. Exiting...
+
Another balancer is running. Exiting...
+
+
+
The administrator can interrupt the execution of the balancer at any
+ time by running the command "stop-balancer.sh" on the machine where the
+ balancer is running.]]>
+
+ Local storage can reside in multiple directories.
+ Each directory should contain the same VERSION file as the others.
+ During startup Hadoop servers (name-node and data-nodes) read their local
+ storage information from them.
+
+ The servers hold a lock for each storage directory while they run so that
+ other nodes were not able to startup sharing the same storage.
+ The locks are released when the servers stop (normally or abnormally).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Removes contents of the current directory and creates an empty directory.
+
+ This does not fully format storage directory.
+ It cannot write the version file since it should be written last after
+ all other storage type dependent files are written.
+ Derived storage is responsible for setting specific storage values and
+ writing the version file to disk.
+
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
node type
+
layout version
+
namespaceID
+
fs state creation time
+
other fields specific for this node type
+
+ The version file is always written last during storage directory updates.
+ The existence of the version file indicates that all other files have
+ been successfully written in the storage directory, the storage is valid
+ and does not need to be recovered.
+
+ @return the version file path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Locking is not supported by all file systems.
+ E.g., NFS does not consistently support exclusive locks.
+
+
If locking is supported we guarantee exculsive access to the
+ storage directory. Otherwise, no guarantee is given.
+
+ @throws IOException if locking fails]]>
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link NamenodeRole#CHECKPOINT} node periodically creates checkpoints,
+ that is downloads image and edits from the active node, merges them, and
+ uploads the new image back to the active.
+
{@link NamenodeRole#BACKUP} node keeps its namespace in sync with the
+ active node, and periodically creates checkpoints by simply saving the
+ namespace image to local disk(s).
+ Finally, the namenode returns its namespaceID as the registrationID
+ for the datanodes.
+ namespaceID is a persistent attribute of the name space.
+ The registrationID is checked every time the datanode is communicating
+ with the namenode.
+ Datanodes with inappropriate registrationID are rejected.
+ If the namenode stops, and then restarts it can restore its
+ namespaceID and will continue serving the datanodes that has previously
+ registered with the namenode without restarting the whole cluster.
+
+ @see org.apache.hadoop.hdfs.server.datanode.DataNode#register()]]>
+
+ The option is passed via configuration field:
+ dfs.namenode.startup
+
+ The conf will be modified to reflect the actual ports on which
+ the NameNode is up and running if the user passes the port as
+ zero in the conf.
+
+ @param conf confirguration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ blocksequence (namespace)
+ 2) block->machinelist ("inodes")
+
+ The first table is stored on disk and is very precious.
+ The second table is rebuilt every time the NameNode comes
+ up.
+
+ 'NameNode' refers to both this class as well as the 'NameNode server'.
+ The 'FSNamesystem' class actually performs most of the filesystem
+ management. The majority of the 'NameNode' class itself is concerned
+ with exposing the IPC interface and the http server to the outside world,
+ plus some configuration management.
+
+ NameNode implements the ClientProtocol interface, which allows
+ clients to ask for DFS services. ClientProtocol is not
+ designed for direct use by authors of DFS client code. End-users
+ should instead use the org.apache.nutch.hadoop.fs.FileSystem class.
+
+ NameNode also implements the DatanodeProtocol interface, used by
+ DataNode programs that actually store DFS data blocks. These
+ methods are invoked repeatedly and automatically by all the
+ DataNodes in a DFS deployment.
+
+ NameNode also implements the NamenodeProtocol interface, used by
+ secondary namenodes or rebalancing processes to get partial namenode's
+ state, for example partial blocksMap etc.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link #FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link #FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
delete corrupted files ({@link #FIXING_DELETE})
+
+
+
detect files with under-replicated or over-replicated blocks
+
+
+
+
+
+
+
+
+
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
move corrupted files to /lost+found directory on DFS
+ ({@link org.apache.hadoop.hdfs.server.namenode.NamenodeFsck#FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
The most important difference is that unlike GFS, Hadoop DFS files
+have strictly one writer at any one time. Bytes are always appended
+to the end of the writer's stream. There is no notion of "record appends"
+or "mutations" that are then checked or reordered. Writers simply emit
+a byte stream. That byte stream is guaranteed to be stored in the
+order written.
+ The client will then have to contact
+ one of the indicated DataNodes to obtain the actual data.
+
+ @param src file name
+ @param offset range start offset
+ @param length range length
+
+ @return file length and array of blocks with their locations
+
+ @throws AccessControlException If access is denied
+ @throws FileNotFoundException If file src does not exist
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This will create an empty file specified by the source path.
+ The path should reflect a full path originated at the root.
+ The name-node does not have a notion of "current" directory for a client.
+
+ Once created, the file is visible and available for read to other clients.
+ Although, other clients cannot {@link #delete(String, boolean)}, re-create or
+ {@link #rename(String, String)} it until the file is completed
+ or explicitly as a result of lease expiration.
+
+ Blocks have a maximum size. Clients that intend to create
+ multi-block files must also use
+ {@link #addBlock(String, String, Block, DatanodeInfo[])}
+
+ @param src path of the file being created.
+ @param masked masked permission.
+ @param clientName name of the current client.
+ @param flag indicates whether the file should be
+ overwritten if it already exists or create if it does not exist or append.
+ @param createParent create missing parent directory if true
+ @param replication block replication factor.
+ @param blockSize maximum block size.
+
+ @throws AccessControlException If access is denied
+ @throws AlreadyBeingCreatedException if the path does not exist.
+ @throws DSQuotaExceededException If file creation violates disk space
+ quota restriction
+ @throws FileAlreadyExistsException If file src already exists
+ @throws FileNotFoundException If parent of src does not exist
+ and createParent is false
+ @throws ParentNotDirectoryException If parent of src is not a
+ directory.
+ @throws NSQuotaExceededException If file creation violates name space
+ quota restriction
+ @throws SafeModeException create not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred
+
+ RuntimeExceptions:
+ @throws InvalidPathException Path src is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws DSQuotaExceededException If append violates disk space quota
+ restriction
+ @throws SafeModeException append not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred.
+
+ RuntimeExceptions:
+ @throws UnsupportedOperationException if append is not supported]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The NameNode sets replication to the new value and returns.
+ The actual block replication is not expected to be performed during
+ this method call. The blocks will be populated or removed in the
+ background as the result of the routine block maintenance procedures.
+
+ @param src file name
+ @param replication new replication
+
+ @return true if successful;
+ false if file does not exist or is a directory
+
+ @throws AccessControlException If access is denied
+ @throws DSQuotaExceededException If replication violates disk space
+ quota restriction
+ @throws FileNotFoundException If file src is not found
+ @throws SafeModeException not allowed in safemode
+ @throws UnresolvedLinkException if src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws SafeModeException not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws SafeModeException not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws NotReplicatedYetException previous blocks of the file are not
+ replicated yet. Blocks cannot be added until replication
+ completes.
+ @throws SafeModeException create not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws SafeModeException create not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ trg or srcs
+ contains a symlink]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fails if src is a file and dst is a directory.
+
Fails if src is a directory and dst is a file.
+
Fails if the parent of dst does not exist or is a file.
+
+
+ Without OVERWRITE option, rename fails if the dst already exists.
+ With OVERWRITE option, rename overwrites the dst, if it is a file
+ or an empty directory. Rename fails if dst is a non-empty directory.
+
+ This implementation of rename is atomic.
+
+ @param src existing file or directory name.
+ @param dst new name.
+ @param options Rename options
+
+ @throws AccessControlException If access is denied
+ @throws DSQuotaExceededException If rename violates disk space
+ quota restriction
+ @throws FileAlreadyExistsException If dst already exists and
+ options has {@link Rename#OVERWRITE} option
+ false.
+ @throws FileNotFoundException If src does not exist
+ @throws NSQuotaExceededException If rename violates namespace
+ quota restriction
+ @throws ParentNotDirectoryException If parent of dst
+ is not a directory
+ @throws SafeModeException rename not allowed in safemode
+ @throws UnresolvedLinkException If src or
+ dst contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+ Any blocks belonging to the deleted files will be garbage-collected.
+
+ @param src existing name.
+ @return true only if the existing file or directory was actually removed
+ from the file system.
+ @throws UnresolvedLinkException if src contains a symlink.
+ @deprecated use {@link #delete(String, boolean)} istead.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+ same as delete but provides a way to avoid accidentally
+ deleting non empty directories programmatically.
+ @param src existing name
+ @param recursive if true deletes a non empty directory recursively,
+ else throws an exception.
+ @return true only if the existing file or directory was actually removed
+ from the file system.
+
+ @throws AccessControlException If access is denied
+ @throws FileNotFoundException If file src is not found
+ @throws SafeModeException create not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ src already exists
+ @throws FileNotFoundException If parent of src does not exist
+ and createParent is false
+ @throws NSQuotaExceededException If file creation violates quota restriction
+ @throws ParentNotDirectoryException If parent of src
+ is not a directory
+ @throws SafeModeException create not allowed in safemode
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred.
+
+ RunTimeExceptions:
+ @throws InvalidPathException If src is invalid]]>
+
+
+
+
+
+
+
+
+
+
+
+ src is not found
+ @throws UnresolvedLinkException If src contains a symlink
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+ So, the NameNode will revoke the locks and live file-creates
+ for clients that it thinks have died. A client tells the
+ NameNode that it is still alive by periodically calling
+ renewLease(). If a certain amount of time passes since
+ the last call to renewLease(), the NameNode assumes the
+ client has died.
+
+ @throws AccessControlException permission denied
+ @throws IOException If an I/O error occurred]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
[0] contains the total storage capacity of the system, in bytes.
+
[1] contains the total used space of the system, in bytes.
+
[2] contains the available storage of the system, in bytes.
+
[3] contains number of under replicated blocks in the system.
+
[4] contains number of blocks with a corrupt replica.
+
[5] contains number of blocks without any good replicas left.
+
+ Use public constants like {@link #GET_STATS_CAPACITY_IDX} in place of
+ actual numbers to index into the array.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Safe mode is a name node state when it
+
does not accept changes to name space (read-only), and
+
does not replicate or delete blocks.
+
+
+ Safe mode is entered automatically at name node startup.
+ Safe mode can also be entered manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}.
+
+ At startup the name node accepts data node reports collecting
+ information about block locations.
+ In order to leave safe mode it needs to collect a configurable
+ percentage called threshold of blocks, which satisfy the minimal
+ replication condition.
+ The minimal replication condition is that each block must have at least
+ dfs.namenode.replication.min replicas.
+ When the threshold is reached the name node extends safe mode
+ for a configurable amount of time
+ to let the remaining data nodes to check in before it
+ will start replicating missing blocks.
+ Then the name node leaves safe mode.
+
+ If safe mode is turned on manually using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_ENTER)}
+ then the name node stays in safe mode until it is manually turned off
+ using {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_LEAVE)}.
+ Current state of the name node can be verified using
+ {@link #setSafeMode(FSConstants.SafeModeAction) setSafeMode(SafeModeAction.SAFEMODE_GET)}
+
Configuration parameters:
+ dfs.safemode.threshold.pct is the threshold parameter.
+ dfs.safemode.extension is the safe mode extension parameter.
+ dfs.namenode.replication.min is the minimal replication parameter.
+
+
Special cases:
+ The name node does not enter safe mode at startup if the threshold is
+ set to 0 or if the name space is empty.
+ If the threshold is set to 1 then all blocks need to have at least
+ minimal replication.
+ If the threshold value is greater than 1 then the name node will not be
+ able to turn off safe mode automatically.
+ Safe mode can always be turned off manually.
+
+ @param action
The layout of how namenode or datanode stores information
+ on disk changes.
+
A new operation code is added to the editlog.
+
Modification such as format of a record, content of a record
+ in editlog or fsimage.
+
+
+ How to update layout version:
+ When a change requires new layout version, please add an entry into
+ {@link Feature} with a short enum name, new layout version and description
+ of the change. Please see {@link Feature} for further details.
+ ]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ To add a new layout version:
+
+
Define a new enum constant with a short enum name, the new layout version
+ and description of the added feature.
+
When adding a layout version with an ancestor that is not same as
+ its immediate predecessor, use the constructor where a spacific ancestor
+ can be passed.
+
+ To start:
+ bin/start-balancer.sh [-threshold ]
+ Example: bin/ start-balancer.sh
+ start the balancer with a default threshold of 10%
+ bin/ start-balancer.sh -threshold 5
+ start the balancer with a threshold of 5%
+ To stop:
+ bin/ stop-balancer.sh
+
+
+
DESCRIPTION
+
The threshold parameter is a fraction in the range of (0%, 100%) with a
+ default value of 10%. The threshold sets a target for whether the cluster
+ is balanced. A cluster is balanced if for each datanode, the utilization
+ of the node (ratio of used space at the node to total capacity of the node)
+ differs from the utilization of the (ratio of used space in the cluster
+ to total capacity of the cluster) by no more than the threshold value.
+ The smaller the threshold, the more balanced a cluster will become.
+ It takes more time to run the balancer for small threshold values.
+ Also for a very small threshold the cluster may not be able to reach the
+ balanced state when applications write and delete files concurrently.
+
+
The tool moves blocks from highly utilized datanodes to poorly
+ utilized datanodes iteratively. In each iteration a datanode moves or
+ receives no more than the lesser of 10G bytes or the threshold fraction
+ of its capacity. Each iteration runs no more than 20 minutes.
+ At the end of each iteration, the balancer obtains updated datanodes
+ information from the namenode.
+
+
A system property that limits the balancer's use of bandwidth is
+ defined in the default configuration file:
+
+
+ dfs.balance.bandwidthPerSec
+ 1048576
+ Specifies the maximum bandwidth that each datanode
+ can utilize for the balancing purpose in term of the number of bytes
+ per second.
+
+
+
+
This property determines the maximum speed at which a block will be
+ moved from one datanode to another. The default value is 1MB/s. The higher
+ the bandwidth, the faster a cluster can reach the balanced state,
+ but with greater competition with application processes. If an
+ administrator changes the value of this property in the configuration
+ file, the change is observed when HDFS is next restarted.
+
+
MONITERING BALANCER PROGRESS
+
After the balancer is started, an output file name where the balancer
+ progress will be recorded is printed on the screen. The administrator
+ can monitor the running of the balancer by reading the output file.
+ The output shows the balancer's status iteration by iteration. In each
+ iteration it prints the starting time, the iteration number, the total
+ number of bytes that have been moved in the previous iterations,
+ the total number of bytes that are left to move in order for the cluster
+ to be balanced, and the number of bytes that are being moved in this
+ iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left
+ To Move" is decreasing.
+
+
Running multiple instances of the balancer in an HDFS cluster is
+ prohibited by the tool.
+
+
The balancer automatically exits when any of the following five
+ conditions is satisfied:
+
+
The cluster is balanced;
+
No block can be moved;
+
No block has been moved for five consecutive iterations;
+
An IOException occurs while communicating with the namenode;
+
Another balancer is running.
+
+
+
Upon exit, a balancer returns an exit code and prints one of the
+ following messages to the output file in corresponding to the above exit
+ reasons:
+
+
The cluster is balanced. Exiting
+
No block can be moved. Exiting...
+
No block has been moved for 3 iterations. Exiting...
+
Received an IO exception: failure reason. Exiting...
+
Another balancer is running. Exiting...
+
+
+
The administrator can interrupt the execution of the balancer at any
+ time by running the command "stop-balancer.sh" on the machine where the
+ balancer is running.]]>
+
+ Local storage can reside in multiple directories.
+ Each directory should contain the same VERSION file as the others.
+ During startup Hadoop servers (name-node and data-nodes) read their local
+ storage information from them.
+
+ The servers hold a lock for each storage directory while they run so that
+ other nodes were not able to startup sharing the same storage.
+ The locks are released when the servers stop (normally or abnormally).]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Removes contents of the current directory and creates an empty directory.
+
+ This does not fully format storage directory.
+ It cannot write the version file since it should be written last after
+ all other storage type dependent files are written.
+ Derived storage is responsible for setting specific storage values and
+ writing the version file to disk.
+
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
node type
+
layout version
+
namespaceID
+
fs state creation time
+
other fields specific for this node type
+
+ The version file is always written last during storage directory updates.
+ The existence of the version file indicates that all other files have
+ been successfully written in the storage directory, the storage is valid
+ and does not need to be recovered.
+
+ @return the version file path]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Locking is not supported by all file systems.
+ E.g., NFS does not consistently support exclusive locks.
+
+
If locking is supported we guarantee exculsive access to the
+ storage directory. Otherwise, no guarantee is given.
+
+ @throws IOException if locking fails]]>
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
{@link NamenodeRole#CHECKPOINT} node periodically creates checkpoints,
+ that is downloads image and edits from the active node, merges them, and
+ uploads the new image back to the active.
+
{@link NamenodeRole#BACKUP} node keeps its namespace in sync with the
+ active node, and periodically creates checkpoints by simply saving the
+ namespace image to local disk(s).
+ Finally, the namenode returns its namespaceID as the registrationID
+ for the datanodes.
+ namespaceID is a persistent attribute of the name space.
+ The registrationID is checked every time the datanode is communicating
+ with the namenode.
+ Datanodes with inappropriate registrationID are rejected.
+ If the namenode stops, and then restarts it can restore its
+ namespaceID and will continue serving the datanodes that has previously
+ registered with the namenode without restarting the whole cluster.
+
+ @see org.apache.hadoop.hdfs.server.datanode.DataNode#register()]]>
+
+ The option is passed via configuration field:
+ dfs.namenode.startup
+
+ The conf will be modified to reflect the actual ports on which
+ the NameNode is up and running if the user passes the port as
+ zero in the conf.
+
+ @param conf confirguration
+ @throws IOException]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ blocksequence (namespace)
+ 2) block->machinelist ("inodes")
+
+ The first table is stored on disk and is very precious.
+ The second table is rebuilt every time the NameNode comes
+ up.
+
+ 'NameNode' refers to both this class as well as the 'NameNode server'.
+ The 'FSNamesystem' class actually performs most of the filesystem
+ management. The majority of the 'NameNode' class itself is concerned
+ with exposing the IPC interface and the http server to the outside world,
+ plus some configuration management.
+
+ NameNode implements the ClientProtocol interface, which allows
+ clients to ask for DFS services. ClientProtocol is not
+ designed for direct use by authors of DFS client code. End-users
+ should instead use the org.apache.nutch.hadoop.fs.FileSystem class.
+
+ NameNode also implements the DatanodeProtocol interface, used by
+ DataNode programs that actually store DFS data blocks. These
+ methods are invoked repeatedly and automatically by all the
+ DataNodes in a DFS deployment.
+
+ NameNode also implements the NamenodeProtocol interface, used by
+ secondary namenodes or rebalancing processes to get partial namenode's
+ state, for example partial blocksMap etc.]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The tool scans all files and directories, starting from an indicated
+ root path. The following abnormal conditions are detected and handled:
+
+
files with blocks that are completely missing from all datanodes.
+ In this case the tool can perform one of the following actions:
+
+
none ({@link #FIXING_NONE})
+
move corrupted files to /lost+found directory on DFS
+ ({@link #FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
delete corrupted files ({@link #FIXING_DELETE})
+
+
+
detect files with under-replicated or over-replicated blocks
+
+
+
+
+
+
+
+
+
+
+ For the metrics that are sampled and averaged, one must specify
+ a metrics context that does periodic update calls. Most metrics contexts do.
+ The default Null metrics context however does NOT. So if you aren't
+ using any other metrics context then you can turn on the viewing and averaging
+ of sampled metrics by specifying the following two lines
+ in the hadoop-meterics.properties file:
+
+ Note that the metrics are collected regardless of the context used.
+ The context with the update thread is used to average the data periodically
+
+
+
+ Impl details: We use a dynamic mbean that gets the list of the metrics
+ from the metrics registry passed as an argument to the constructor]]>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ This class has a number of metrics variables that are publicly accessible;
+ these variables (objects) have methods to update their values;
+ for example:
+
move corrupted files to /lost+found directory on DFS
+ ({@link org.apache.hadoop.hdfs.server.namenode.NamenodeFsck#FIXING_MOVE}). Remaining data blocks are saved as a
+ block chains, representing longest consecutive series of valid blocks.
+
+
+
diff --git a/x86_64/share/hadoop/hdfs/webapps/static/hadoop.css b/x86_64/share/hadoop/hdfs/webapps/static/hadoop.css
new file mode 100644
index 0000000..9b031c7
--- /dev/null
+++ b/x86_64/share/hadoop/hdfs/webapps/static/hadoop.css
@@ -0,0 +1,190 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements. See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License. You may obtain a copy of the License at
+*
+* http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+body {
+ background-color : #ffffff;
+ font-family : sans-serif;
+}
+
+.small {
+ font-size : smaller;
+}
+
+div#dfsnodetable tr#row1, div.dfstable td.col1 {
+ font-weight : bolder;
+}
+
+div.dfstable th {
+ text-align:left;
+ vertical-align : top;
+}
+
+div.dfstable td#col3 {
+ text-align : right;
+}
+
+div#dfsnodetable caption {
+ text-align : left;
+}
+
+div#dfsnodetable a#title {
+ font-size : larger;
+ font-weight : bolder;
+}
+
+div#dfsnodetable td, th {
+ padding-bottom : 4px;
+ padding-top : 4px;
+}
+
+div#dfsnodetable A:link, A:visited {
+ text-decoration : none;
+}
+
+div#dfsnodetable th.header, th.headerASC, th.headerDSC {
+ padding-bottom : 8px;
+ padding-top : 8px;
+}
+div#dfsnodetable th.header:hover, th.headerASC:hover, th.headerDSC:hover,
+ td.name:hover {
+ text-decoration : underline;
+ cursor : pointer;
+}
+
+div#dfsnodetable td.blocks, td.size, td.pcused, td.adminstate, td.lastcontact {
+ text-align : right;
+}
+
+div#dfsnodetable .rowNormal .header {
+ background-color : #ffffff;
+}
+div#dfsnodetable .rowAlt, .headerASC, .headerDSC {
+ background-color : lightyellow;
+}
+
+.warning {
+ font-weight : bolder;
+ color : red;
+}
+
+div.dfstable table {
+ white-space : pre;
+}
+
+table.storage, table.nodes {
+ border-collapse: collapse;
+}
+
+table.storage td {
+ padding:10px;
+ border:1px solid black;
+}
+
+table.nodes td {
+ padding:0px;
+ border:1px solid black;
+}
+
+div#dfsnodetable td, div#dfsnodetable th, div.dfstable td {
+ padding-left : 10px;
+ padding-right : 10px;
+ border:1px solid black;
+}
+
+td.perc_filled {
+ background-color:#AAAAFF;
+}
+
+td.perc_nonfilled {
+ background-color:#FFFFFF;
+}
+
+line.taskgraphline {
+ stroke-width:1;stroke-linecap:round;
+}
+
+#quicklinks {
+ margin: 0;
+ padding: 2px 4px;
+ position: fixed;
+ top: 0;
+ right: 0;
+ text-align: right;
+ background-color: #eee;
+ font-weight: bold;
+}
+
+#quicklinks ul {
+ margin: 0;
+ padding: 0;
+ list-style-type: none;
+ font-weight: normal;
+}
+
+#quicklinks ul {
+ display: none;
+}
+
+#quicklinks a {
+ font-size: smaller;
+ text-decoration: none;
+}
+
+#quicklinks ul a {
+ text-decoration: underline;
+}
+
+span.failed {
+ color:red;
+}
+
+div.security {
+ width:100%;
+}
+
+#startupprogress table, #startupprogress th, #startupprogress td {
+ border-collapse: collapse;
+ border-left: 1px solid black;
+ border-right: 1px solid black;
+ padding: 5px;
+ text-align: left;
+}
+
+#startupprogress table {
+ border: 1px solid black;
+}
+
+.phase {
+ border-top: 1px solid black;
+ font-weight: bold;
+}
+
+.current {
+ font-style: italic;
+}
+
+.later {
+ color: gray;
+}
+
+.step .startupdesc {
+ text-indent: 20px;
+}
+
+#startupprogress span {
+ font-weight: bold;
+}
diff --git a/x86_64/share/hadoop/httpfs/tomcat/LICENSE b/x86_64/share/hadoop/httpfs/tomcat/LICENSE
new file mode 100644
index 0000000..51c34b6
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/LICENSE
@@ -0,0 +1,707 @@
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+
+
+APACHE TOMCAT SUBCOMPONENTS:
+
+Apache Tomcat includes a number of subcomponents with separate copyright notices
+and license terms. Your use of these subcomponents is subject to the terms and
+conditions of the following licenses.
+
+
+For the Eclipse JDT Java compiler:
+
+Eclipse Public License - v 1.0
+
+THE ACCOMPANYING PROGRAM IS PROVIDED UNDER THE TERMS OF THIS ECLIPSE PUBLIC
+LICENSE ("AGREEMENT"). ANY USE, REPRODUCTION OR DISTRIBUTION OF THE PROGRAM
+CONSTITUTES RECIPIENT'S ACCEPTANCE OF THIS AGREEMENT.
+
+1. DEFINITIONS
+
+"Contribution" means:
+
+a) in the case of the initial Contributor, the initial code and documentation
+distributed under this Agreement, and
+
+b) in the case of each subsequent Contributor:
+
+i) changes to the Program, and
+
+ii) additions to the Program;
+
+where such changes and/or additions to the Program originate from and are
+distributed by that particular Contributor. A Contribution 'originates' from a
+Contributor if it was added to the Program by such Contributor itself or anyone
+acting on such Contributor's behalf. Contributions do not include additions to
+the Program which: (i) are separate modules of software distributed in
+conjunction with the Program under their own license agreement, and (ii) are not
+derivative works of the Program.
+
+"Contributor" means any person or entity that distributes the Program.
+
+"Licensed Patents" mean patent claims licensable by a Contributor which are
+necessarily infringed by the use or sale of its Contribution alone or when
+combined with the Program.
+
+"Program" means the Contributions distributed in accordance with this Agreement.
+
+"Recipient" means anyone who receives the Program under this Agreement,
+including all Contributors.
+
+2. GRANT OF RIGHTS
+
+a) Subject to the terms of this Agreement, each Contributor hereby grants
+Recipient a non-exclusive, worldwide, royalty-free copyright license to
+reproduce, prepare derivative works of, publicly display, publicly perform,
+distribute and sublicense the Contribution of such Contributor, if any, and such
+derivative works, in source code and object code form.
+
+b) Subject to the terms of this Agreement, each Contributor hereby grants
+Recipient a non-exclusive, worldwide, royalty-free patent license under Licensed
+Patents to make, use, sell, offer to sell, import and otherwise transfer the
+Contribution of such Contributor, if any, in source code and object code form.
+This patent license shall apply to the combination of the Contribution and the
+Program if, at the time the Contribution is added by the Contributor, such
+addition of the Contribution causes such combination to be covered by the
+Licensed Patents. The patent license shall not apply to any other combinations
+which include the Contribution. No hardware per se is licensed hereunder.
+
+c) Recipient understands that although each Contributor grants the licenses to
+its Contributions set forth herein, no assurances are provided by any
+Contributor that the Program does not infringe the patent or other intellectual
+property rights of any other entity. Each Contributor disclaims any liability to
+Recipient for claims brought by any other entity based on infringement of
+intellectual property rights or otherwise. As a condition to exercising the
+rights and licenses granted hereunder, each Recipient hereby assumes sole
+responsibility to secure any other intellectual property rights needed, if any.
+For example, if a third party patent license is required to allow Recipient to
+distribute the Program, it is Recipient's responsibility to acquire that license
+before distributing the Program.
+
+d) Each Contributor represents that to its knowledge it has sufficient copyright
+rights in its Contribution, if any, to grant the copyright license set forth in
+this Agreement.
+
+3. REQUIREMENTS
+
+A Contributor may choose to distribute the Program in object code form under its
+own license agreement, provided that:
+
+a) it complies with the terms and conditions of this Agreement; and
+
+b) its license agreement:
+
+i) effectively disclaims on behalf of all Contributors all warranties and
+conditions, express and implied, including warranties or conditions of title and
+non-infringement, and implied warranties or conditions of merchantability and
+fitness for a particular purpose;
+
+ii) effectively excludes on behalf of all Contributors all liability for
+damages, including direct, indirect, special, incidental and consequential
+damages, such as lost profits;
+
+iii) states that any provisions which differ from this Agreement are offered by
+that Contributor alone and not by any other party; and
+
+iv) states that source code for the Program is available from such Contributor,
+and informs licensees how to obtain it in a reasonable manner on or through a
+medium customarily used for software exchange.
+
+When the Program is made available in source code form:
+
+a) it must be made available under this Agreement; and
+
+b) a copy of this Agreement must be included with each copy of the Program.
+
+Contributors may not remove or alter any copyright notices contained within the
+Program.
+
+Each Contributor must identify itself as the originator of its Contribution, if
+any, in a manner that reasonably allows subsequent Recipients to identify the
+originator of the Contribution.
+
+4. COMMERCIAL DISTRIBUTION
+
+Commercial distributors of software may accept certain responsibilities with
+respect to end users, business partners and the like. While this license is
+intended to facilitate the commercial use of the Program, the Contributor who
+includes the Program in a commercial product offering should do so in a manner
+which does not create potential liability for other Contributors. Therefore, if
+a Contributor includes the Program in a commercial product offering, such
+Contributor ("Commercial Contributor") hereby agrees to defend and indemnify
+every other Contributor ("Indemnified Contributor") against any losses, damages
+and costs (collectively "Losses") arising from claims, lawsuits and other legal
+actions brought by a third party against the Indemnified Contributor to the
+extent caused by the acts or omissions of such Commercial Contributor in
+connection with its distribution of the Program in a commercial product
+offering. The obligations in this section do not apply to any claims or Losses
+relating to any actual or alleged intellectual property infringement. In order
+to qualify, an Indemnified Contributor must: a) promptly notify the Commercial
+Contributor in writing of such claim, and b) allow the Commercial Contributor
+to control, and cooperate with the Commercial Contributor in, the defense and
+any related settlement negotiations. The Indemnified Contributor may
+participate in any such claim at its own expense.
+
+For example, a Contributor might include the Program in a commercial product
+offering, Product X. That Contributor is then a Commercial Contributor. If that
+Commercial Contributor then makes performance claims, or offers warranties
+related to Product X, those performance claims and warranties are such
+Commercial Contributor's responsibility alone. Under this section, the
+Commercial Contributor would have to defend claims against the other
+Contributors related to those performance claims and warranties, and if a court
+requires any other Contributor to pay any damages as a result, the Commercial
+Contributor must pay those damages.
+
+5. NO WARRANTY
+
+EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS PROVIDED ON AN
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR
+IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE,
+NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Each
+Recipient is solely responsible for determining the appropriateness of using and
+distributing the Program and assumes all risks associated with its exercise of
+rights under this Agreement , including but not limited to the risks and costs
+of program errors, compliance with applicable laws, damage to or loss of data,
+programs or equipment, and unavailability or interruption of operations.
+
+6. DISCLAIMER OF LIABILITY
+
+EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, NEITHER RECIPIENT NOR ANY
+CONTRIBUTORS SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST
+PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OR DISTRIBUTION OF THE PROGRAM OR THE EXERCISE OF ANY RIGHTS
+GRANTED HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+7. GENERAL
+
+If any provision of this Agreement is invalid or unenforceable under applicable
+law, it shall not affect the validity or enforceability of the remainder of the
+terms of this Agreement, and without further action by the parties hereto, such
+provision shall be reformed to the minimum extent necessary to make such
+provision valid and enforceable.
+
+If Recipient institutes patent litigation against any entity (including a
+cross-claim or counterclaim in a lawsuit) alleging that the Program itself
+(excluding combinations of the Program with other software or hardware)
+infringes such Recipient's patent(s), then such Recipient's rights granted under
+Section 2(b) shall terminate as of the date such litigation is filed.
+
+All Recipient's rights under this Agreement shall terminate if it fails to
+comply with any of the material terms or conditions of this Agreement and does
+not cure such failure in a reasonable period of time after becoming aware of
+such noncompliance. If all Recipient's rights under this Agreement terminate,
+Recipient agrees to cease use and distribution of the Program as soon as
+reasonably practicable. However, Recipient's obligations under this Agreement
+and any licenses granted by Recipient relating to the Program shall continue and
+survive.
+
+Everyone is permitted to copy and distribute copies of this Agreement, but in
+order to avoid inconsistency the Agreement is copyrighted and may only be
+modified in the following manner. The Agreement Steward reserves the right to
+publish new versions (including revisions) of this Agreement from time to time.
+No one other than the Agreement Steward has the right to modify this Agreement.
+The Eclipse Foundation is the initial Agreement Steward. The Eclipse Foundation
+may assign the responsibility to serve as the Agreement Steward to a suitable
+separate entity. Each new version of the Agreement will be given a
+distinguishing version number. The Program (including Contributions) may always
+be distributed subject to the version of the Agreement under which it was
+received. In addition, after a new version of the Agreement is published,
+Contributor may elect to distribute the Program (including its Contributions)
+under the new version. Except as expressly stated in Sections 2(a) and 2(b)
+above, Recipient receives no rights or licenses to the intellectual property of
+any Contributor under this Agreement, whether expressly, by implication,
+estoppel or otherwise. All rights in the Program not expressly granted under
+this Agreement are reserved.
+
+This Agreement is governed by the laws of the State of New York and the
+intellectual property laws of the United States of America. No party to this
+Agreement will bring a legal action under this Agreement more than one year
+after the cause of action arose. Each party waives its rights to a jury trial in
+any resulting litigation.
+
+
+For the Windows Installer component:
+
+ * All NSIS source code, plug-ins, documentation, examples, header files and
+ graphics, with the exception of the compression modules and where
+ otherwise noted, are licensed under the zlib/libpng license.
+ * The zlib compression module for NSIS is licensed under the zlib/libpng
+ license.
+ * The bzip2 compression module for NSIS is licensed under the bzip2 license.
+ * The lzma compression module for NSIS is licensed under the Common Public
+ License version 1.0.
+
+zlib/libpng license
+
+This software is provided 'as-is', without any express or implied warranty. In
+no event will the authors be held liable for any damages arising from the use of
+this software.
+
+Permission is granted to anyone to use this software for any purpose, including
+commercial applications, and to alter it and redistribute it freely, subject to
+the following restrictions:
+
+ 1. The origin of this software must not be misrepresented; you must not claim
+ that you wrote the original software. If you use this software in a
+ product, an acknowledgment in the product documentation would be
+ appreciated but is not required.
+ 2. Altered source versions must be plainly marked as such, and must not be
+ misrepresented as being the original software.
+ 3. This notice may not be removed or altered from any source distribution.
+
+bzip2 license
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ 1. Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ 2. The origin of this software must not be misrepresented; you must not claim
+ that you wrote the original software. If you use this software in a
+ product, an acknowledgment in the product documentation would be
+ appreciated but is not required.
+ 3. Altered source versions must be plainly marked as such, and must not be
+ misrepresented as being the original software.
+ 4. The name of the author may not be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS AND ANY EXPRESS OR IMPLIED
+WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
+OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
+OF SUCH DAMAGE.
+
+Julian Seward, Cambridge, UK.
+
+jseward@acm.org
+Common Public License version 1.0
+
+THE ACCOMPANYING PROGRAM IS PROVIDED UNDER THE TERMS OF THIS COMMON PUBLIC
+LICENSE ("AGREEMENT"). ANY USE, REPRODUCTION OR DISTRIBUTION OF THE PROGRAM
+CONSTITUTES RECIPIENT'S ACCEPTANCE OF THIS AGREEMENT.
+
+1. DEFINITIONS
+
+"Contribution" means:
+
+a) in the case of the initial Contributor, the initial code and documentation
+distributed under this Agreement, and b) in the case of each subsequent
+Contributor:
+
+i) changes to the Program, and
+
+ii) additions to the Program;
+
+where such changes and/or additions to the Program originate from and are
+distributed by that particular Contributor. A Contribution 'originates' from a
+Contributor if it was added to the Program by such Contributor itself or anyone
+acting on such Contributor's behalf. Contributions do not include additions to
+the Program which: (i) are separate modules of software distributed in
+conjunction with the Program under their own license agreement, and (ii) are not
+derivative works of the Program.
+
+"Contributor" means any person or entity that distributes the Program.
+
+"Licensed Patents " mean patent claims licensable by a Contributor which are
+necessarily infringed by the use or sale of its Contribution alone or when
+combined with the Program.
+
+"Program" means the Contributions distributed in accordance with this Agreement.
+
+"Recipient" means anyone who receives the Program under this Agreement,
+including all Contributors.
+
+2. GRANT OF RIGHTS
+
+a) Subject to the terms of this Agreement, each Contributor hereby grants
+Recipient a non-exclusive, worldwide, royalty-free copyright license to
+reproduce, prepare derivative works of, publicly display, publicly perform,
+distribute and sublicense the Contribution of such Contributor, if any, and such
+derivative works, in source code and object code form.
+
+b) Subject to the terms of this Agreement, each Contributor hereby grants
+Recipient a non-exclusive, worldwide, royalty-free patent license under Licensed
+Patents to make, use, sell, offer to sell, import and otherwise transfer the
+Contribution of such Contributor, if any, in source code and object code form.
+This patent license shall apply to the combination of the Contribution and the
+Program if, at the time the Contribution is added by the Contributor, such
+addition of the Contribution causes such combination to be covered by the
+Licensed Patents. The patent license shall not apply to any other combinations
+which include the Contribution. No hardware per se is licensed hereunder.
+
+c) Recipient understands that although each Contributor grants the licenses to
+its Contributions set forth herein, no assurances are provided by any
+Contributor that the Program does not infringe the patent or other intellectual
+property rights of any other entity. Each Contributor disclaims any liability to
+Recipient for claims brought by any other entity based on infringement of
+intellectual property rights or otherwise. As a condition to exercising the
+rights and licenses granted hereunder, each Recipient hereby assumes sole
+responsibility to secure any other intellectual property rights needed, if any.
+For example, if a third party patent license is required to allow Recipient to
+distribute the Program, it is Recipient's responsibility to acquire that license
+before distributing the Program.
+
+d) Each Contributor represents that to its knowledge it has sufficient copyright
+rights in its Contribution, if any, to grant the copyright license set forth in
+this Agreement.
+
+3. REQUIREMENTS
+
+A Contributor may choose to distribute the Program in object code form under its
+own license agreement, provided that:
+
+a) it complies with the terms and conditions of this Agreement; and
+
+b) its license agreement:
+
+i) effectively disclaims on behalf of all Contributors all warranties and
+conditions, express and implied, including warranties or conditions of title and
+non-infringement, and implied warranties or conditions of merchantability and
+fitness for a particular purpose;
+
+ii) effectively excludes on behalf of all Contributors all liability for
+damages, including direct, indirect, special, incidental and consequential
+damages, such as lost profits;
+
+iii) states that any provisions which differ from this Agreement are offered by
+that Contributor alone and not by any other party; and
+
+iv) states that source code for the Program is available from such Contributor,
+and informs licensees how to obtain it in a reasonable manner on or through a
+medium customarily used for software exchange.
+
+When the Program is made available in source code form:
+
+a) it must be made available under this Agreement; and
+
+b) a copy of this Agreement must be included with each copy of the Program.
+
+Contributors may not remove or alter any copyright notices contained within the
+Program.
+
+Each Contributor must identify itself as the originator of its Contribution, if
+any, in a manner that reasonably allows subsequent Recipients to identify the
+originator of the Contribution.
+
+4. COMMERCIAL DISTRIBUTION
+
+Commercial distributors of software may accept certain responsibilities with
+respect to end users, business partners and the like. While this license is
+intended to facilitate the commercial use of the Program, the Contributor who
+includes the Program in a commercial product offering should do so in a manner
+which does not create potential liability for other Contributors. Therefore, if
+a Contributor includes the Program in a commercial product offering, such
+Contributor ("Commercial Contributor") hereby agrees to defend and indemnify
+every other Contributor ("Indemnified Contributor") against any losses, damages
+and costs (collectively "Losses") arising from claims, lawsuits and other legal
+actions brought by a third party against the Indemnified Contributor to the
+extent caused by the acts or omissions of such Commercial Contributor in
+connection with its distribution of the Program in a commercial product
+offering. The obligations in this section do not apply to any claims or Losses
+relating to any actual or alleged intellectual property infringement. In order
+to qualify, an Indemnified Contributor must: a) promptly notify the Commercial
+Contributor in writing of such claim, and b) allow the Commercial Contributor to
+control, and cooperate with the Commercial Contributor in, the defense and any
+related settlement negotiations. The Indemnified Contributor may participate in
+any such claim at its own expense.
+
+For example, a Contributor might include the Program in a commercial product
+offering, Product X. That Contributor is then a Commercial Contributor. If that
+Commercial Contributor then makes performance claims, or offers warranties
+related to Product X, those performance claims and warranties are such
+Commercial Contributor's responsibility alone. Under this section, the
+Commercial Contributor would have to defend claims against the other
+Contributors related to those performance claims and warranties, and if a court
+requires any other Contributor to pay any damages as a result, the Commercial
+Contributor must pay those damages.
+
+5. NO WARRANTY
+
+EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS PROVIDED ON AN
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR
+IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE,
+NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Each
+Recipient is solely responsible for determining the appropriateness of using and
+distributing the Program and assumes all risks associated with its exercise of
+rights under this Agreement, including but not limited to the risks and costs of
+program errors, compliance with applicable laws, damage to or loss of data,
+programs or equipment, and unavailability or interruption of operations.
+
+6. DISCLAIMER OF LIABILITY
+
+EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, NEITHER RECIPIENT NOR ANY
+CONTRIBUTORS SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST
+PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OR DISTRIBUTION OF THE PROGRAM OR THE EXERCISE OF ANY RIGHTS
+GRANTED HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+7. GENERAL
+
+If any provision of this Agreement is invalid or unenforceable under applicable
+law, it shall not affect the validity or enforceability of the remainder of the
+terms of this Agreement, and without further action by the parties hereto, such
+provision shall be reformed to the minimum extent necessary to make such
+provision valid and enforceable.
+
+If Recipient institutes patent litigation against a Contributor with respect to
+a patent applicable to software (including a cross-claim or counterclaim in a
+lawsuit), then any patent licenses granted by that Contributor to such Recipient
+under this Agreement shall terminate as of the date such litigation is filed. In
+addition, if Recipient institutes patent litigation against any entity
+(including a cross-claim or counterclaim in a lawsuit) alleging that the Program
+itself (excluding combinations of the Program with other software or hardware)
+infringes such Recipient's patent(s), then such Recipient's rights granted under
+Section 2(b) shall terminate as of the date such litigation is filed.
+
+All Recipient's rights under this Agreement shall terminate if it fails to
+comply with any of the material terms or conditions of this Agreement and does
+not cure such failure in a reasonable period of time after becoming aware of
+such noncompliance. If all Recipient's rights under this Agreement terminate,
+Recipient agrees to cease use and distribution of the Program as soon as
+reasonably practicable. However, Recipient's obligations under this Agreement
+and any licenses granted by Recipient relating to the Program shall continue and
+survive.
+
+Everyone is permitted to copy and distribute copies of this Agreement, but in
+order to avoid inconsistency the Agreement is copyrighted and may only be
+modified in the following manner. The Agreement Steward reserves the right to
+publish new versions (including revisions) of this Agreement from time to time.
+No one other than the Agreement Steward has the right to modify this Agreement.
+IBM is the initial Agreement Steward. IBM may assign the responsibility to serve
+as the Agreement Steward to a suitable separate entity. Each new version of the
+Agreement will be given a distinguishing version number. The Program (including
+Contributions) may always be distributed subject to the version of the Agreement
+under which it was received. In addition, after a new version of the Agreement
+is published, Contributor may elect to distribute the Program (including its
+Contributions) under the new version. Except as expressly stated in Sections
+2(a) and 2(b) above, Recipient receives no rights or licenses to the
+intellectual property of any Contributor under this Agreement, whether
+expressly, by implication, estoppel or otherwise. All rights in the Program not
+expressly granted under this Agreement are reserved.
+
+This Agreement is governed by the laws of the State of New York and the
+intellectual property laws of the United States of America. No party to this
+Agreement will bring a legal action under this Agreement more than one year
+after the cause of action arose. Each party waives its rights to a jury trial in
+any resulting litigation.
+
+Special exception for LZMA compression module
+
+Igor Pavlov and Amir Szekely, the authors of the LZMA compression module for
+NSIS, expressly permit you to statically or dynamically link your code (or bind
+by name) to the files from the LZMA compression module for NSIS without
+subjecting your linked code to the terms of the Common Public license version
+1.0. Any modifications or additions to files from the LZMA compression module
+for NSIS, however, are subject to the terms of the Common Public License version
+1.0.
\ No newline at end of file
diff --git a/x86_64/share/hadoop/httpfs/tomcat/NOTICE b/x86_64/share/hadoop/httpfs/tomcat/NOTICE
new file mode 100644
index 0000000..aaa19b6
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/NOTICE
@@ -0,0 +1,16 @@
+Apache Tomcat
+Copyright 1999-2012 The Apache Software Foundation
+
+This product includes software developed by
+The Apache Software Foundation (http://www.apache.org/).
+
+The Windows Installer is built with the Nullsoft
+Scriptable Install Sysem (NSIS), which is
+open source software. The original software and
+related information is available at
+http://nsis.sourceforge.net.
+
+Java compilation software for JSP pages is provided by Eclipse,
+which is open source software. The original software and
+related information is available at
+http://www.eclipse.org.
diff --git a/x86_64/share/hadoop/httpfs/tomcat/RELEASE-NOTES b/x86_64/share/hadoop/httpfs/tomcat/RELEASE-NOTES
new file mode 100644
index 0000000..a90c104
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/RELEASE-NOTES
@@ -0,0 +1,234 @@
+================================================================================
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+================================================================================
+
+$Id: RELEASE-NOTES 1392285 2012-10-01 11:32:00Z kkolinko $
+
+
+ Apache Tomcat Version 6.0.36
+ Release Notes
+
+
+=============================
+KNOWN ISSUES IN THIS RELEASE:
+=============================
+
+* Dependency Changes
+* JNI Based Applications
+* Bundled APIs
+* Web application reloading and static fields in shared libraries
+* Tomcat on Linux
+* Enabling SSI and CGI Support
+* Security manager URLs
+* Symlinking static resources
+* Enabling invoker servlet
+* Viewing the Tomcat Change Log
+* Cryptographic software notice
+* When all else fails
+
+
+===================
+Dependency Changes:
+===================
+Tomcat 6.0 is designed to run on Java SE 5.0 and later.
+
+In addition, Tomcat 6.0 uses the Eclipse JDT Java compiler for compiling
+JSP pages. This means you no longer need to have the complete
+Java Development Kit (JDK) to run Tomcat, but a Java Runtime Environment
+(JRE) is sufficient. The Eclipse JDT Java compiler is bundled with the
+binary Tomcat distributions. Tomcat can also be configured to use the
+compiler from the JDK to compile JSPs, or any other Java compiler supported
+by Apache Ant.
+
+
+=======================
+JNI Based Applications:
+=======================
+Applications that require native libraries must ensure that the libraries have
+been loaded prior to use. Typically, this is done with a call like:
+
+ static {
+ System.loadLibrary("path-to-library-file");
+ }
+
+in some class. However, the application must also ensure that the library is
+not loaded more than once. If the above code were placed in a class inside
+the web application (i.e. under /WEB-INF/classes or /WEB-INF/lib), and the
+application were reloaded, the loadLibrary() call would be attempted a second
+time.
+
+To avoid this problem, place classes that load native libraries outside of the
+web application, and ensure that the loadLibrary() call is executed only once
+during the lifetime of a particular JVM.
+
+
+=============
+Bundled APIs:
+=============
+A standard installation of Tomcat 6.0 makes all of the following APIs available
+for use by web applications (by placing them in "lib"):
+* annotations-api.jar (Annotations package)
+* catalina.jar (Tomcat Catalina implementation)
+* catalina-ant.jar (Tomcat Catalina Ant tasks)
+* catalina-ha.jar (High availability package)
+* catalina-tribes.jar (Group communication)
+* ecj-@JDT_VERSION@.jar (Eclipse JDT Java compiler)
+* el-api.jar (EL 2.1 API)
+* jasper.jar (Jasper 2 Compiler and Runtime)
+* jasper-el.jar (Jasper 2 EL implementation)
+* jsp-api.jar (JSP 2.1 API)
+* servlet-api.jar (Servlet 2.5 API)
+* tomcat-coyote.jar (Tomcat connectors and utility classes)
+* tomcat-dbcp.jar (package renamed database connection pool based on Commons DBCP)
+
+You can make additional APIs available to all of your web applications by
+putting unpacked classes into a "classes" directory (not created by default),
+or by placing them in JAR files in the "lib" directory.
+
+To override the XML parser implementation or interfaces, use the endorsed
+mechanism of the JVM. The default configuration defines JARs located in
+"endorsed" as endorsed.
+
+
+================================================================
+Web application reloading and static fields in shared libraries:
+================================================================
+Some shared libraries (many are part of the JDK) keep references to objects
+instantiated by the web application. To avoid class loading related problems
+(ClassCastExceptions, messages indicating that the classloader
+is stopped, etc.), the shared libraries state should be reinitialized.
+
+Something which might help is to avoid putting classes which would be
+referenced by a shared static field in the web application classloader,
+and putting them in the shared classloader instead (JARs should be put in the
+"lib" folder, and classes should be put in the "classes" folder).
+
+
+================
+Tomcat on Linux:
+================
+GLIBC 2.2 / Linux 2.4 users should define an environment variable:
+export LD_ASSUME_KERNEL=2.2.5
+
+Redhat Linux 9.0 users should use the following setting to avoid
+stability problems:
+export LD_ASSUME_KERNEL=2.4.1
+
+There are some Linux bugs reported against the NIO sendfile behavior, make sure you
+have a JDK that is up to date, or disable sendfile behavior in the Connector.
+6427312: (fc) FileChannel.transferTo() throws IOException "system call interrupted"
+5103988: (fc) FileChannel.transferTo should return -1 for EAGAIN instead throws IOException
+6253145: (fc) FileChannel.transferTo on Linux fails when going beyond 2GB boundary
+6470086: (fc) FileChannel.transferTo(2147483647, 1, channel) cause "Value too large" exception
+
+
+=============================
+Enabling SSI and CGI Support:
+=============================
+Because of the security risks associated with CGI and SSI available
+to web applications, these features are disabled by default.
+
+To enable and configure CGI support, please see the cgi-howto.html page.
+
+To enable and configue SSI support, please see the ssi-howto.html page.
+
+
+======================
+Security manager URLs:
+======================
+In order to grant security permissions to JARs located inside the
+web application repository, use URLs of of the following format
+in your policy file:
+
+file:${catalina.base}/webapps/examples/WEB-INF/lib/driver.jar
+
+
+============================
+Symlinking static resources:
+============================
+By default, Unix symlinks will not work when used in a web application to link
+resources located outside the web application root directory.
+
+This behavior is optional, and the "allowLinking" flag may be used to disable
+the check.
+
+
+=========================
+Enabling invoker servlet:
+=========================
+Starting with Tomcat 4.1.12, the invoker servlet is no longer available by
+default in all webapps. Enabling it for all webapps is possible by editing
+$CATALINA_HOME/conf/web.xml to uncomment the "/servlet/*" servlet-mapping
+definition.
+
+Using the invoker servlet in a production environment is not recommended and
+is unsupported. More details are available on the Tomcat FAQ at
+http://tomcat.apache.org/faq/misc.html#invoker.
+
+
+==============================
+Viewing the Tomcat Change Log:
+==============================
+See changelog.html in this directory.
+
+
+============================================
+Multi-byte charset handling bug in Java 1.5:
+============================================
+Public versions of Sun/Oracle Java 1.5 are known to have a nasty bug in
+implementation of Charset.decode() method for certain character sets.
+
+For details, test and a list of affected character sets see:
+
+http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6196991
+https://issues.apache.org/bugzilla/show_bug.cgi?id=52579
+
+The UTF-8 charset is not affected by this issue.
+
+
+=============================
+Cryptographic software notice
+=============================
+This distribution includes cryptographic software. The country in
+which you currently reside may have restrictions on the import,
+possession, use, and/or re-export to another country, of
+encryption software. BEFORE using any encryption software, please
+check your country's laws, regulations and policies concerning the
+import, possession, or use, and re-export of encryption software, to
+see if this is permitted. See for more
+information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and
+Security (BIS), has classified this software as Export Commodity
+Control Number (ECCN) 5D002.C.1, which includes information security
+software using or performing cryptographic functions with asymmetric
+algorithms. The form and manner of this Apache Software Foundation
+distribution makes it eligible for export under the License Exception
+ENC Technology Software Unrestricted (TSU) exception (see the BIS
+Export Administration Regulations, Section 740.13) for both object
+code and source code.
+
+The following provides more details on the included cryptographic
+software:
+ - Tomcat includes code designed to work with JSSE
+ - Tomcat includes code designed to work with OpenSSL
+
+
+====================
+When all else fails:
+====================
+See the FAQ
+http://tomcat.apache.org/faq/
diff --git a/x86_64/share/hadoop/httpfs/tomcat/RUNNING.txt b/x86_64/share/hadoop/httpfs/tomcat/RUNNING.txt
new file mode 100644
index 0000000..2e56cf9
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/RUNNING.txt
@@ -0,0 +1,454 @@
+================================================================================
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+================================================================================
+
+$Id: RUNNING.txt 1348600 2012-06-10 14:11:54Z kkolinko $
+
+ ===================================================
+ Running The Apache Tomcat 6.0 Servlet/JSP Container
+ ===================================================
+
+Apache Tomcat 6.0 requires a Java Standard Edition Runtime
+Environment (JRE) version 5.0 or later.
+
+=============================
+Running With JRE 5.0 Or Later
+=============================
+
+(1) Download and Install a Java SE Runtime Environment (JRE)
+
+(1.1) Download a Java SE Runtime Environment (JRE),
+ release version 5.0 or later, from
+ http://www.oracle.com/technetwork/java/javase/downloads/index.html
+
+(1.2) Install the JRE according to the instructions included with the
+ release.
+
+ You may also use a full Java Development Kit (JDK) rather than just
+ a JRE.
+
+
+(2) Download and Install Apache Tomcat
+
+(2.1) Download a binary distribution of Tomcat from:
+
+ http://tomcat.apache.org/
+
+(2.2) Unpack the binary distribution so that it resides in its own
+ directory (conventionally named "apache-tomcat-[version]").
+
+ For the purposes of the remainder of this document, the name
+ "CATALINA_HOME" is used to refer to the full pathname of that
+ directory.
+
+NOTE: As an alternative to downloading a binary distribution, you can
+create your own from the Tomcat source code, as described in
+"BUILDING.txt". You can either
+
+ a) Do the full "release" build and find the created distribution in the
+ "output/release" directory and then proceed with unpacking as above, or
+
+ b) Do a simple build and use the "output/build" directory as
+ "CATALINA_HOME". Be warned that there are some differences between the
+ contents of the "output/build" directory and a full "release"
+ distribution.
+
+
+(3) Configure Environment Variables
+
+Tomcat is a Java application and does not use environment variables. The
+variables are used by the Tomcat startup scripts. The scripts use the variables
+to prepare the command that starts Tomcat.
+
+(3.1) Set CATALINA_HOME (required) and CATALINA_BASE (optional)
+
+The CATALINA_HOME and CATALINA_BASE environment variables are used to
+specify the location of Apache Tomcat and the location of its active
+configuration, respectively.
+
+The CATALINA_HOME environment variable should be set as defined in (2.2)
+above. The Tomcat startup scripts have some logic to set this variable
+automatically if it is absent (based on the location of the script in
+Unixes and on the current directory in Windows), but this logic might not work
+in all circumstances.
+
+The CATALINA_BASE environment variable is optional and is further described
+in the "Multiple Tomcat Instances" section below. If it is absent, it defaults
+to be equal to CATALINA_HOME.
+
+
+(3.2) Set JRE_HOME or JAVA_HOME (required)
+
+The JRE_HOME variable is used to specify location of a JRE that is used to
+start Tomcat.
+
+The JAVA_HOME variable is used to specify location of a JDK. It is used instead
+of JRE_HOME.
+
+Using JAVA_HOME provides access to certain additional startup options that
+are not allowed when JRE_HOME is used.
+
+If both JRE_HOME and JAVA_HOME are specified, JRE_HOME is used.
+
+
+(3.3) Other variables (optional)
+
+Other environment variables exist, besides the four described above.
+See the comments at the top of catalina.bat or catalina.sh scripts for
+the list and a description of each of them.
+
+One frequently used variable is CATALINA_OPTS. It allows specification of
+additional options for the java command that starts Tomcat.
+
+See the Java documentation for the options that affect the Java Runtime
+Environment.
+
+See the "System Properties" page in the Tomcat Configuration Reference for
+the system properties that are specific to Tomcat.
+
+A similar variable is JAVA_OPTS. It is used less frequently. It allows
+specification of options that are used both to start and to stop Tomcat as well
+as for other commands.
+
+Do not use JAVA_OPTS to specify memory limits. You do not need much memory
+for a small process that is used to stop Tomcat. Those settings belong to
+CATALINA_OPTS.
+
+Another frequently used variable is CATALINA_PID (on *nix platforms only). It
+specifies the location of the file where process id of the forked Tomcat java
+process will be written. This setting is optional. It will enable the
+following features:
+
+ - better protection against duplicate start attempts and
+ - allows forceful termination of Tomcat process when it does not react to
+ the standard shutdown command.
+
+
+(3.4) setenv script (optional)
+
+Apart from CATALINA_HOME and CATALINA_BASE, all environment variables can
+be specified in the "setenv" script.
+
+The script is named setenv.bat (Windows) or setenv.sh (*nix). It can be
+placed either into CATALINA_BASE/bin or into CATALINA_HOME/bin. The file
+has to be readable.
+
+By default the setenv script file is absent. If the setenv script is
+present both in CATALINA_BASE and in CATALINA_HOME, the one in
+CATALINA_BASE is used.
+
+For example, to configure the JRE_HOME and CATALINA_PID variables you can
+create the following script file:
+
+On Windows, %CATALINA_BASE%\bin\setenv.bat:
+
+ set "JRE_HOME=%ProgramFiles%\Java\jre6"
+ exit /b 0
+
+On Unix, $CATALINA_BASE/bin/setenv.sh:
+
+ JRE_HOME=/usr/java/latest
+ CATALINA_PID="$CATALINA_BASE/tomcat.pid"
+
+You cannot configure CATALINA_HOME and CATALINA_BASE variables in the
+setenv script, because they are used to find that file.
+
+
+(4) Start Up Tomcat
+
+(4.1) Tomcat can be started by executing one of the following commands:
+
+ %CATALINA_HOME%\bin\startup.bat (Windows)
+
+ $CATALINA_HOME/bin/startup.sh (Unix)
+
+ or
+
+ %CATALINA_HOME%\bin\catalina.bat start (Windows)
+
+ $CATALINA_HOME/bin/catalina.sh start (Unix)
+
+(4.2) After startup, the default web applications included with Tomcat will be
+ available by visiting:
+
+ http://localhost:8080/
+
+(4.3) Further information about configuring and running Tomcat can be found in
+ the documentation included here, as well as on the Tomcat web site:
+
+ http://tomcat.apache.org/
+
+
+(5) Shut Down Tomcat
+
+(5.1) Tomcat can be shut down by executing one of the following commands:
+
+ %CATALINA_HOME%\bin\shutdown.bat (Windows)
+
+ $CATALINA_HOME/bin/shutdown.sh (Unix)
+
+ or
+
+ %CATALINA_HOME%\bin\catalina.bat stop (Windows)
+
+ $CATALINA_HOME/bin/catalina.sh stop (Unix)
+
+==================================================
+Advanced Configuration - Multiple Tomcat Instances
+==================================================
+
+In many circumstances, it is desirable to have a single copy of a Tomcat
+binary distribution shared among multiple users on the same server. To make
+this possible, you can set the CATALINA_BASE environment variable to the
+directory that contains the files for your 'personal' Tomcat instance.
+
+When running with a separate CATALINA_HOME and CATALINA_BASE, the files
+and directories are split as following:
+
+In CATALINA_BASE:
+
+ * bin - Only the following files:
+
+ * setenv.sh (*nix) or setenv.bat (Windows),
+ * tomcat-juli.jar
+
+ The setenv scripts were described above. The tomcat-juli library
+ is documented in the Logging chapter in the User Guide.
+
+ * conf - Server configuration files (including server.xml)
+
+ * lib - Libraries and classes, as explained below
+
+ * logs - Log and output files
+
+ * webapps - Automatically loaded web applications
+
+ * work - Temporary working directories for web applications
+
+ * temp - Directory used by the JVM for temporary files (java.io.tmpdir)
+
+
+In CATALINA_HOME:
+
+ * bin - Startup and shutdown scripts
+
+ The following files will be used only if they are absent in
+ CATALINA_BASE/bin:
+
+ setenv.sh (*nix), setenv.bat (Windows), tomcat-juli.jar
+
+ * lib - Libraries and classes, as explained below
+
+ * endorsed - Libraries that override standard "Endorsed Standards"
+ libraries provided by JRE. See Classloading documentation
+ in the User Guide for details.
+
+ By default this "endorsed" directory is absent.
+
+In the default configuration the JAR libraries and classes both in
+CATALINA_BASE/lib and in CATALINA_HOME/lib will be added to the common
+classpath, but the ones in CATALINA_BASE will be added first and thus will
+be searched first.
+
+The idea is that you may leave the standard Tomcat libraries in
+CATALINA_HOME/lib and add other ones such as database drivers into
+CATALINA_BASE/lib.
+
+In general it is advised to never share libraries between web applications,
+but put them into WEB-INF/lib directories inside the applications. See
+Classloading documentation in the User Guide for details.
+
+
+It might be useful to note that the values of CATALINA_HOME and
+CATALINA_BASE can be referenced in the XML configuration files processed
+by Tomcat as ${catalina.home} and ${catalina.base} respectively.
+
+For example, the standard manager web application can be kept in
+CATALINA_HOME/webapps/manager and loaded into CATALINA_BASE by using
+the following trick:
+
+ * Copy the CATALINA_HOME/webapps/manager/META-INF/context.xml
+ file as CATALINA_BASE/conf/Catalina/localhost/manager.xml
+
+ * Add docBase attribute as shown below.
+
+The file will look like the following:
+
+
+
+
+
+See Deployer chapter in User Guide and Context and Host chapters in the
+Configuration Reference for more information on contexts and web
+application deployment.
+
+
+================
+Troubleshooting
+================
+
+There are only really 2 things likely to go wrong during the stand-alone
+Tomcat install:
+
+(1) The most common hiccup is when another web server (or any process for that
+ matter) has laid claim to port 8080. This is the default HTTP port that
+ Tomcat attempts to bind to at startup. To change this, open the file:
+
+ $CATALINA_HOME/conf/server.xml
+
+ and search for '8080'. Change it to a port that isn't in use, and is
+ greater than 1024, as ports less than or equal to 1024 require superuser
+ access to bind under UNIX.
+
+ Restart Tomcat and you're in business. Be sure that you replace the "8080"
+ in the URL you're using to access Tomcat. For example, if you change the
+ port to 1977, you would request the URL http://localhost:1977/ in your
+ browser.
+
+(2) The 'localhost' machine isn't found. This could happen if you're behind a
+ proxy. If that's the case, make sure the proxy configuration for your
+ browser knows that you shouldn't be going through the proxy to access the
+ "localhost".
+
+ In Firefox, this is under Tools/Preferences -> Advanced/Network ->
+ Connection -> Settings..., and in Internet Explorer it is Tools ->
+ Internet Options -> Connections -> LAN Settings.
+
+
+====================
+Optional Components
+====================
+
+The following optional components may be included with the Apache Tomcat binary
+distribution. If they are not included, you can install them separately.
+
+ 1. Apache Tomcat Native library
+
+ 2. Apache Commons Daemon service launcher
+
+Both of them are implemented in C language and as such have to be compiled
+into binary code. The binary code will be specific for a platform and CPU
+architecture and it must match the Java Runtime Environment executables
+that will be used to launch Tomcat.
+
+The Windows-specific binary distributions of Apache Tomcat include binary
+files for these components. On other platforms you would have to look for
+binary versions elsewhere or compile them yourself.
+
+If you are new to Tomcat, do not bother with these components to start with.
+If you do use them, do not forget to read their documentation.
+
+
+Apache Tomcat Native library
+-----------------------------
+
+It is a library that allows to use the "Apr" variant of HTTP and AJP
+protocol connectors in Apache Tomcat. It is built around OpenSSL and Apache
+Portable Runtime (APR) libraries. Those are the same libraries as used by
+Apache HTTPD Server project.
+
+This feature was especially important in the old days when Java performance
+was poor. It is less important nowadays, but it is still used and respected
+by many. See Tomcat documentation for more details.
+
+For further reading:
+
+ - Apache Tomcat documentation
+
+ * Documentation for APR/Native library in the Tomcat User's Guide
+
+ http://tomcat.apache.org/tomcat-6.0-doc/apr.html
+
+ * Documentation for the HTTP and AJP protocol connectors in the Tomcat
+ Configuration Reference
+
+ http://tomcat.apache.org/tomcat-6.0-doc/config/http.html
+
+ http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html
+
+ - Apache Tomcat Native project home
+
+ http://tomcat.apache.org/native-doc/
+
+ - Other projects
+
+ * OpenSSL
+
+ http://openssl.org/
+
+ * Apache Portable Runtime
+
+ http://apr.apache.org/
+
+ * Apache HTTP Server
+
+ http://httpd.apache.org/
+
+To disable Apache Tomcat Native library:
+
+ - To disable Apache Tomcat Native library when it is installed, or
+ - To remove the warning that is logged during Tomcat startup when the
+ library is not installed:
+
+ Edit the "conf/server.xml" file and remove "AprLifecycleListener" from
+ it.
+
+The binary file of Apache Tomcat Native library is usually named
+
+ - "tcnative-1.dll" on Windows
+ - "libtcnative-1.so" on *nix systems
+
+
+Apache Commons Daemon
+----------------------
+
+Apache Commons Daemon project provides wrappers that can be used to
+install Apache Tomcat as a service on Windows or as a daemon on *nix
+systems.
+
+The Windows-specific implementation of Apache Commons Daemon is called
+"procrun". The *nix-specific one is called "jsvc".
+
+For further reading:
+
+ - Apache Commons Daemon project
+
+ http://commons.apache.org/daemon/
+
+ - Apache Tomcat documentation
+
+ * Installing Apache Tomcat
+
+ http://tomcat.apache.org/tomcat-6.0-doc/setup.html
+
+ * Windows service HOW-TO
+
+ http://tomcat.apache.org/tomcat-6.0-doc/windows-service-howto.html
+
+The binary files of Apache Commons Daemon in Apache Tomcat distributions
+for Windows are named:
+
+ - "tomcat6.exe"
+ - "tomcat6w.exe"
+
+These files are renamed copies of "prunsrv.exe" and "prunmgr.exe" from
+Apache Commons Daemon distribution. The file names have a meaning: they are
+used as the service name to register the service in Windows, as well as the
+key name to store distinct configuration for this installation of
+"procrun". If you would like to install several instances of Tomcat 6.0
+in parallel, you have to further rename those files, using the same naming
+scheme.
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/bootstrap.jar b/x86_64/share/hadoop/httpfs/tomcat/bin/bootstrap.jar
new file mode 100644
index 0000000..eb8c330
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/bin/bootstrap.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/catalina-tasks.xml b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina-tasks.xml
new file mode 100644
index 0000000..8b802ce
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina-tasks.xml
@@ -0,0 +1,58 @@
+
+
+
+
+
+ Catalina Ant Manager, JMX and JSPC Tasks
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.bat
new file mode 100644
index 0000000..07de2dc
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.bat
@@ -0,0 +1,286 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Start/Stop Script for the CATALINA Server
+rem
+rem Environment Variable Prerequisites
+rem
+rem CATALINA_HOME May point at your Catalina "build" directory.
+rem
+rem CATALINA_BASE (Optional) Base directory for resolving dynamic portions
+rem of a Catalina installation. If not present, resolves to
+rem the same directory that CATALINA_HOME points to.
+rem
+rem CATALINA_OPTS (Optional) Java runtime options used when the "start",
+rem or "run" command is executed.
+rem
+rem CATALINA_TMPDIR (Optional) Directory path location of temporary directory
+rem the JVM should use (java.io.tmpdir). Defaults to
+rem %CATALINA_BASE%\temp.
+rem
+rem JAVA_HOME Must point at your Java Development Kit installation.
+rem Required to run the with the "debug" argument.
+rem
+rem JRE_HOME Must point at your Java Runtime installation.
+rem Defaults to JAVA_HOME if empty.
+rem
+rem JAVA_OPTS (Optional) Java runtime options used when the "start",
+rem "stop", or "run" command is executed.
+rem
+rem JAVA_ENDORSED_DIRS (Optional) Lists of of semi-colon separated directories
+rem containing some jars in order to allow replacement of APIs
+rem created outside of the JCP (i.e. DOM and SAX from W3C).
+rem It can also be used to update the XML parser implementation.
+rem Defaults to $CATALINA_HOME/endorsed.
+rem
+rem JPDA_TRANSPORT (Optional) JPDA transport used when the "jpda start"
+rem command is executed. The default is "dt_socket".
+rem
+rem JPDA_ADDRESS (Optional) Java runtime options used when the "jpda start"
+rem command is executed. The default is 8000.
+rem
+rem JPDA_SUSPEND (Optional) Java runtime options used when the "jpda start"
+rem command is executed. Specifies whether JVM should suspend
+rem execution immediately after startup. Default is "n".
+rem
+rem JPDA_OPTS (Optional) Java runtime options used when the "jpda start"
+rem command is executed. If used, JPDA_TRANSPORT, JPDA_ADDRESS,
+rem and JPDA_SUSPEND are ignored. Thus, all required jpda
+rem options MUST be specified. The default is:
+rem
+rem -agentlib:jdwp=transport=%JPDA_TRANSPORT%,
+rem address=%JPDA_ADDRESS%,server=y,suspend=%JPDA_SUSPEND%
+rem
+rem LOGGING_CONFIG (Optional) Override Tomcat's logging config file
+rem Example (all one line)
+rem set LOGGING_CONFIG="-Djava.util.logging.config.file=%CATALINA_BASE%\conf\logging.properties"
+rem
+rem LOGGING_MANAGER (Optional) Override Tomcat's logging manager
+rem Example (all one line)
+rem set LOGGING_MANAGER="-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
+rem
+rem TITLE (Optional) Specify the title of Tomcat window. The default
+rem TITLE is Tomcat if it's not specified.
+rem Example (all one line)
+rem set TITLE=Tomcat.Cluster#1.Server#1 [%DATE% %TIME%]
+rem
+rem
+rem
+rem $Id: catalina.bat 1146097 2011-07-13 15:25:05Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+set "CURRENT_DIR=%cd%"
+if not "%CATALINA_HOME%" == "" goto gotHome
+set "CATALINA_HOME=%CURRENT_DIR%"
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+cd ..
+set "CATALINA_HOME=%cd%"
+cd "%CURRENT_DIR%"
+:gotHome
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+rem Copy CATALINA_BASE from CATALINA_HOME if not defined
+if not "%CATALINA_BASE%" == "" goto gotBase
+set "CATALINA_BASE=%CATALINA_HOME%"
+:gotBase
+
+rem Ensure that any user defined CLASSPATH variables are not used on startup,
+rem but allow them to be specified in setenv.bat, in rare case when it is needed.
+set CLASSPATH=
+
+rem Get standard environment variables
+if not exist "%CATALINA_BASE%\bin\setenv.bat" goto checkSetenvHome
+call "%CATALINA_BASE%\bin\setenv.bat"
+goto setenvDone
+:checkSetenvHome
+if exist "%CATALINA_HOME%\bin\setenv.bat" call "%CATALINA_HOME%\bin\setenv.bat"
+:setenvDone
+
+rem Get standard Java environment variables
+if exist "%CATALINA_HOME%\bin\setclasspath.bat" goto okSetclasspath
+echo Cannot find "%CATALINA_HOME%\bin\setclasspath.bat"
+echo This file is needed to run this program
+goto end
+:okSetclasspath
+set "BASEDIR=%CATALINA_HOME%"
+call "%CATALINA_HOME%\bin\setclasspath.bat" %1
+if errorlevel 1 goto end
+
+if not "%CATALINA_TMPDIR%" == "" goto gotTmpdir
+set "CATALINA_TMPDIR=%CATALINA_BASE%\temp"
+:gotTmpdir
+
+rem Add tomcat-juli.jar and bootstrap.jar to classpath
+rem tomcat-juli.jar can be over-ridden per instance
+rem Note that there are no quotes as we do not want to introduce random
+rem quotes into the CLASSPATH
+if "%CLASSPATH%" == "" goto emptyClasspath
+set "CLASSPATH=%CLASSPATH%;"
+:emptyClasspath
+if "%CATALINA_BASE%" == "%CATALINA_HOME%" goto juliClasspathHome
+if not exist "%CATALINA_BASE%\bin\tomcat-juli.jar" goto juliClasspathHome
+set "CLASSPATH=%CLASSPATH%%CATALINA_BASE%\bin\tomcat-juli.jar;%CATALINA_HOME%\bin\bootstrap.jar"
+goto juliClasspathDone
+:juliClasspathHome
+set "CLASSPATH=%CLASSPATH%%CATALINA_HOME%\bin\bootstrap.jar"
+:juliClasspathDone
+
+if not "%LOGGING_CONFIG%" == "" goto noJuliConfig
+set LOGGING_CONFIG=-Dnop
+if not exist "%CATALINA_BASE%\conf\logging.properties" goto noJuliConfig
+set LOGGING_CONFIG=-Djava.util.logging.config.file="%CATALINA_BASE%\conf\logging.properties"
+:noJuliConfig
+set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%
+
+if not "%LOGGING_MANAGER%" == "" goto noJuliManager
+set LOGGING_MANAGER=-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
+:noJuliManager
+set JAVA_OPTS=%JAVA_OPTS% %LOGGING_MANAGER%
+
+rem ----- Execute The Requested Command ---------------------------------------
+
+echo Using CATALINA_BASE: "%CATALINA_BASE%"
+echo Using CATALINA_HOME: "%CATALINA_HOME%"
+echo Using CATALINA_TMPDIR: "%CATALINA_TMPDIR%"
+if ""%1"" == ""debug"" goto use_jdk
+echo Using JRE_HOME: "%JRE_HOME%"
+goto java_dir_displayed
+:use_jdk
+echo Using JAVA_HOME: "%JAVA_HOME%"
+:java_dir_displayed
+echo Using CLASSPATH: "%CLASSPATH%"
+
+set _EXECJAVA=%_RUNJAVA%
+set MAINCLASS=org.apache.catalina.startup.Bootstrap
+set ACTION=start
+set SECURITY_POLICY_FILE=
+set DEBUG_OPTS=
+set JPDA=
+
+if not ""%1"" == ""jpda"" goto noJpda
+set JPDA=jpda
+if not "%JPDA_TRANSPORT%" == "" goto gotJpdaTransport
+set JPDA_TRANSPORT=dt_socket
+:gotJpdaTransport
+if not "%JPDA_ADDRESS%" == "" goto gotJpdaAddress
+set JPDA_ADDRESS=8000
+:gotJpdaAddress
+if not "%JPDA_SUSPEND%" == "" goto gotJpdaSuspend
+set JPDA_SUSPEND=n
+:gotJpdaSuspend
+if not "%JPDA_OPTS%" == "" goto gotJpdaOpts
+set JPDA_OPTS=-agentlib:jdwp=transport=%JPDA_TRANSPORT%,address=%JPDA_ADDRESS%,server=y,suspend=%JPDA_SUSPEND%
+:gotJpdaOpts
+shift
+:noJpda
+
+if ""%1"" == ""debug"" goto doDebug
+if ""%1"" == ""run"" goto doRun
+if ""%1"" == ""start"" goto doStart
+if ""%1"" == ""stop"" goto doStop
+if ""%1"" == ""version"" goto doVersion
+
+echo Usage: catalina ( commands ... )
+echo commands:
+echo debug Start Catalina in a debugger
+echo debug -security Debug Catalina with a security manager
+echo jpda start Start Catalina under JPDA debugger
+echo run Start Catalina in the current window
+echo run -security Start in the current window with security manager
+echo start Start Catalina in a separate window
+echo start -security Start in a separate window with security manager
+echo stop Stop Catalina
+echo version What version of tomcat are you running?
+goto end
+
+:doDebug
+shift
+set _EXECJAVA=%_RUNJDB%
+set DEBUG_OPTS=-sourcepath "%CATALINA_HOME%\..\..\java"
+if not ""%1"" == ""-security"" goto execCmd
+shift
+echo Using Security Manager
+set "SECURITY_POLICY_FILE=%CATALINA_BASE%\conf\catalina.policy"
+goto execCmd
+
+:doRun
+shift
+if not ""%1"" == ""-security"" goto execCmd
+shift
+echo Using Security Manager
+set "SECURITY_POLICY_FILE=%CATALINA_BASE%\conf\catalina.policy"
+goto execCmd
+
+:doStart
+shift
+if not "%OS%" == "Windows_NT" goto noTitle
+if "%TITLE%" == "" set TITLE=Tomcat
+set _EXECJAVA=start "%TITLE%" %_RUNJAVA%
+goto gotTitle
+:noTitle
+set _EXECJAVA=start %_RUNJAVA%
+:gotTitle
+if not ""%1"" == ""-security"" goto execCmd
+shift
+echo Using Security Manager
+set "SECURITY_POLICY_FILE=%CATALINA_BASE%\conf\catalina.policy"
+goto execCmd
+
+:doStop
+shift
+set ACTION=stop
+set CATALINA_OPTS=
+goto execCmd
+
+:doVersion
+%_EXECJAVA% -classpath "%CATALINA_HOME%\lib\catalina.jar" org.apache.catalina.util.ServerInfo
+goto end
+
+
+:execCmd
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+rem Execute Java with the applicable properties
+if not "%JPDA%" == "" goto doJpda
+if not "%SECURITY_POLICY_FILE%" == "" goto doSecurity
+%_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %DEBUG_OPTS% -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%" -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%" -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
+goto end
+:doSecurity
+%_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %DEBUG_OPTS% -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%" -Djava.security.manager -Djava.security.policy=="%SECURITY_POLICY_FILE%" -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%" -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
+goto end
+:doJpda
+if not "%SECURITY_POLICY_FILE%" == "" goto doSecurityJpda
+%_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %JPDA_OPTS% %DEBUG_OPTS% -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%" -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%" -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
+goto end
+:doSecurityJpda
+%_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %JPDA_OPTS% %DEBUG_OPTS% -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%" -Djava.security.manager -Djava.security.policy=="%SECURITY_POLICY_FILE%" -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%" -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
+goto end
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.sh
new file mode 100755
index 0000000..41cd97c
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/catalina.sh
@@ -0,0 +1,506 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Start/Stop Script for the CATALINA Server
+#
+# Environment Variable Prerequisites
+#
+# CATALINA_HOME May point at your Catalina "build" directory.
+#
+# CATALINA_BASE (Optional) Base directory for resolving dynamic portions
+# of a Catalina installation. If not present, resolves to
+# the same directory that CATALINA_HOME points to.
+#
+# CATALINA_OUT (Optional) Full path to a file where stdout and stderr
+# will be redirected.
+# Default is $CATALINA_BASE/logs/catalina.out
+#
+# CATALINA_OPTS (Optional) Java runtime options used when the "start",
+# or "run" command is executed.
+#
+# CATALINA_TMPDIR (Optional) Directory path location of temporary directory
+# the JVM should use (java.io.tmpdir). Defaults to
+# $CATALINA_BASE/temp.
+#
+# JAVA_HOME Must point at your Java Development Kit installation.
+# Required to run the with the "debug" argument.
+#
+# JRE_HOME Must point at your Java Development Kit installation.
+# Defaults to JAVA_HOME if empty.
+#
+# JAVA_OPTS (Optional) Java runtime options used when the "start",
+# "stop", or "run" command is executed.
+#
+# JAVA_ENDORSED_DIRS (Optional) Lists of of colon separated directories
+# containing some jars in order to allow replacement of APIs
+# created outside of the JCP (i.e. DOM and SAX from W3C).
+# It can also be used to update the XML parser implementation.
+# Defaults to $CATALINA_HOME/endorsed.
+#
+# JPDA_TRANSPORT (Optional) JPDA transport used when the "jpda start"
+# command is executed. The default is "dt_socket".
+#
+# JPDA_ADDRESS (Optional) Java runtime options used when the "jpda start"
+# command is executed. The default is 8000.
+#
+# JPDA_SUSPEND (Optional) Java runtime options used when the "jpda start"
+# command is executed. Specifies whether JVM should suspend
+# execution immediately after startup. Default is "n".
+#
+# JPDA_OPTS (Optional) Java runtime options used when the "jpda start"
+# command is executed. If used, JPDA_TRANSPORT, JPDA_ADDRESS,
+# and JPDA_SUSPEND are ignored. Thus, all required jpda
+# options MUST be specified. The default is:
+#
+# -agentlib:jdwp=transport=$JPDA_TRANSPORT,
+# address=$JPDA_ADDRESS,server=y,suspend=$JPDA_SUSPEND
+#
+# CATALINA_PID (Optional) Path of the file which should contains the pid
+# of catalina startup java process, when start (fork) is used
+#
+# LOGGING_CONFIG (Optional) Override Tomcat's logging config file
+# Example (all one line)
+# LOGGING_CONFIG="-Djava.util.logging.config.file=$CATALINA_BASE/conf/logging.properties"
+#
+# LOGGING_MANAGER (Optional) Override Tomcat's logging manager
+# Example (all one line)
+# LOGGING_MANAGER="-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
+#
+# $Id: catalina.sh 1146097 2011-07-13 15:25:05Z markt $
+# -----------------------------------------------------------------------------
+
+# OS specific support. $var _must_ be set to either true or false.
+cygwin=false
+os400=false
+darwin=false
+case "`uname`" in
+CYGWIN*) cygwin=true;;
+OS400*) os400=true;;
+Darwin*) darwin=true;;
+esac
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ]; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+# Get standard environment variables
+PRGDIR=`dirname "$PRG"`
+
+# Only set CATALINA_HOME if not already set
+[ -z "$CATALINA_HOME" ] && CATALINA_HOME=`cd "$PRGDIR/.." >/dev/null; pwd`
+
+# Copy CATALINA_BASE from CATALINA_HOME if not already set
+[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"
+
+# Ensure that any user defined CLASSPATH variables are not used on startup,
+# but allow them to be specified in setenv.sh, in rare case when it is needed.
+CLASSPATH=
+
+if [ -r "$CATALINA_BASE/bin/setenv.sh" ]; then
+ . "$CATALINA_BASE/bin/setenv.sh"
+elif [ -r "$CATALINA_HOME/bin/setenv.sh" ]; then
+ . "$CATALINA_HOME/bin/setenv.sh"
+fi
+
+# For Cygwin, ensure paths are in UNIX format before anything is touched
+if $cygwin; then
+ [ -n "$JAVA_HOME" ] && JAVA_HOME=`cygpath --unix "$JAVA_HOME"`
+ [ -n "$JRE_HOME" ] && JRE_HOME=`cygpath --unix "$JRE_HOME"`
+ [ -n "$CATALINA_HOME" ] && CATALINA_HOME=`cygpath --unix "$CATALINA_HOME"`
+ [ -n "$CATALINA_BASE" ] && CATALINA_BASE=`cygpath --unix "$CATALINA_BASE"`
+ [ -n "$CLASSPATH" ] && CLASSPATH=`cygpath --path --unix "$CLASSPATH"`
+fi
+
+# For OS400
+if $os400; then
+ # Set job priority to standard for interactive (interactive - 6) by using
+ # the interactive priority - 6, the helper threads that respond to requests
+ # will be running at the same priority as interactive jobs.
+ COMMAND='chgjob job('$JOBNAME') runpty(6)'
+ system $COMMAND
+
+ # Enable multi threading
+ export QIBM_MULTI_THREADED=Y
+fi
+
+# Get standard Java environment variables
+if $os400; then
+ # -r will Only work on the os400 if the files are:
+ # 1. owned by the user
+ # 2. owned by the PRIMARY group of the user
+ # this will not work if the user belongs in secondary groups
+ BASEDIR="$CATALINA_HOME"
+ . "$CATALINA_HOME"/bin/setclasspath.sh
+else
+ if [ -r "$CATALINA_HOME"/bin/setclasspath.sh ]; then
+ BASEDIR="$CATALINA_HOME"
+ . "$CATALINA_HOME"/bin/setclasspath.sh
+ else
+ echo "Cannot find $CATALINA_HOME/bin/setclasspath.sh"
+ echo "This file is needed to run this program"
+ exit 1
+ fi
+fi
+
+if [ -z "$CATALINA_BASE" ] ; then
+ CATALINA_BASE="$CATALINA_HOME"
+fi
+
+# Add tomcat-juli.jar and bootstrap.jar to classpath
+# tomcat-juli.jar can be over-ridden per instance
+if [ ! -z "$CLASSPATH" ] ; then
+ CLASSPATH="$CLASSPATH":
+fi
+if [ "$CATALINA_BASE" != "$CATALINA_HOME" ] && [ -r "$CATALINA_BASE/bin/tomcat-juli.jar" ] ; then
+ CLASSPATH="$CLASSPATH""$CATALINA_BASE"/bin/tomcat-juli.jar:"$CATALINA_HOME"/bin/bootstrap.jar
+else
+ CLASSPATH="$CLASSPATH""$CATALINA_HOME"/bin/bootstrap.jar
+fi
+
+if [ -z "$CATALINA_OUT" ] ; then
+ CATALINA_OUT="$CATALINA_BASE"/logs/catalina.out
+fi
+
+if [ -z "$CATALINA_TMPDIR" ] ; then
+ # Define the java.io.tmpdir to use for Catalina
+ CATALINA_TMPDIR="$CATALINA_BASE"/temp
+fi
+
+# Bugzilla 37848: When no TTY is available, don't output to console
+have_tty=0
+if [ "`tty`" != "not a tty" ]; then
+ have_tty=1
+fi
+
+# For Cygwin, switch paths to Windows format before running java
+if $cygwin; then
+ JAVA_HOME=`cygpath --absolute --windows "$JAVA_HOME"`
+ JRE_HOME=`cygpath --absolute --windows "$JRE_HOME"`
+ CATALINA_HOME=`cygpath --absolute --windows "$CATALINA_HOME"`
+ CATALINA_BASE=`cygpath --absolute --windows "$CATALINA_BASE"`
+ CATALINA_TMPDIR=`cygpath --absolute --windows "$CATALINA_TMPDIR"`
+ CLASSPATH=`cygpath --path --windows "$CLASSPATH"`
+ JAVA_ENDORSED_DIRS=`cygpath --path --windows "$JAVA_ENDORSED_DIRS"`
+fi
+
+# Set juli LogManager config file if it is present and an override has not been issued
+if [ -z "$LOGGING_CONFIG" ]; then
+ if [ -r "$CATALINA_BASE"/conf/logging.properties ]; then
+ LOGGING_CONFIG="-Djava.util.logging.config.file=$CATALINA_BASE/conf/logging.properties"
+ else
+ # Bugzilla 45585
+ LOGGING_CONFIG="-Dnop"
+ fi
+fi
+
+if [ -z "$LOGGING_MANAGER" ]; then
+ JAVA_OPTS="$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
+else
+ JAVA_OPTS="$JAVA_OPTS $LOGGING_MANAGER"
+fi
+
+# ----- Execute The Requested Command -----------------------------------------
+
+# Bugzilla 37848: only output this if we have a TTY
+if [ $have_tty -eq 1 ]; then
+ echo "Using CATALINA_BASE: $CATALINA_BASE"
+ echo "Using CATALINA_HOME: $CATALINA_HOME"
+ echo "Using CATALINA_TMPDIR: $CATALINA_TMPDIR"
+ if [ "$1" = "debug" ] ; then
+ echo "Using JAVA_HOME: $JAVA_HOME"
+ else
+ echo "Using JRE_HOME: $JRE_HOME"
+ fi
+ echo "Using CLASSPATH: $CLASSPATH"
+ if [ ! -z "$CATALINA_PID" ]; then
+ echo "Using CATALINA_PID: $CATALINA_PID"
+ fi
+fi
+
+if [ "$1" = "jpda" ] ; then
+ if [ -z "$JPDA_TRANSPORT" ]; then
+ JPDA_TRANSPORT="dt_socket"
+ fi
+ if [ -z "$JPDA_ADDRESS" ]; then
+ JPDA_ADDRESS="8000"
+ fi
+ if [ -z "$JPDA_SUSPEND" ]; then
+ JPDA_SUSPEND="n"
+ fi
+ if [ -z "$JPDA_OPTS" ]; then
+ JPDA_OPTS="-agentlib:jdwp=transport=$JPDA_TRANSPORT,address=$JPDA_ADDRESS,server=y,suspend=$JPDA_SUSPEND"
+ fi
+ CATALINA_OPTS="$CATALINA_OPTS $JPDA_OPTS"
+ shift
+fi
+
+if [ "$1" = "debug" ] ; then
+ if $os400; then
+ echo "Debug command not available on OS400"
+ exit 1
+ else
+ shift
+ if [ "$1" = "-security" ] ; then
+ if [ $have_tty -eq 1 ]; then
+ echo "Using Security Manager"
+ fi
+ shift
+ exec "$_RUNJDB" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -sourcepath "$CATALINA_HOME"/../../java \
+ -Djava.security.manager \
+ -Djava.security.policy=="$CATALINA_BASE"/conf/catalina.policy \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start
+ else
+ exec "$_RUNJDB" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -sourcepath "$CATALINA_HOME"/../../java \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start
+ fi
+ fi
+
+elif [ "$1" = "run" ]; then
+
+ shift
+ if [ "$1" = "-security" ] ; then
+ if [ $have_tty -eq 1 ]; then
+ echo "Using Security Manager"
+ fi
+ shift
+ exec "$_RUNJAVA" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Djava.security.manager \
+ -Djava.security.policy=="$CATALINA_BASE"/conf/catalina.policy \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start
+ else
+ exec "$_RUNJAVA" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start
+ fi
+
+elif [ "$1" = "start" ] ; then
+
+ if [ ! -z "$CATALINA_PID" ]; then
+ if [ -f "$CATALINA_PID" ]; then
+ if [ -s "$CATALINA_PID" ]; then
+ echo "Existing PID file found during start."
+ if [ -r "$CATALINA_PID" ]; then
+ PID=`cat "$CATALINA_PID"`
+ ps -p $PID >/dev/null 2>&1
+ if [ $? -eq 0 ] ; then
+ echo "Tomcat appears to still be running with PID $PID. Start aborted."
+ exit 1
+ else
+ echo "Removing/clearing stale PID file."
+ rm -f "$CATALINA_PID" >/dev/null 2>&1
+ if [ $? != 0 ]; then
+ if [ -w "$CATALINA_PID" ]; then
+ cat /dev/null > "$CATALINA_PID"
+ else
+ echo "Unable to remove or clear stale PID file. Start aborted."
+ exit 1
+ fi
+ fi
+ fi
+ else
+ echo "Unable to read PID file. Start aborted."
+ exit 1
+ fi
+ else
+ rm -f "$CATALINA_PID" >/dev/null 2>&1
+ if [ $? != 0 ]; then
+ if [ ! -w "$CATALINA_PID" ]; then
+ echo "Unable to remove or write to empty PID file. Start aborted."
+ exit 1
+ fi
+ fi
+ fi
+ fi
+ fi
+
+ shift
+ touch "$CATALINA_OUT"
+ if [ "$1" = "-security" ] ; then
+ if [ $have_tty -eq 1 ]; then
+ echo "Using Security Manager"
+ fi
+ shift
+ "$_RUNJAVA" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Djava.security.manager \
+ -Djava.security.policy=="$CATALINA_BASE"/conf/catalina.policy \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start \
+ >> "$CATALINA_OUT" 2>&1 &
+
+ else
+ "$_RUNJAVA" "$LOGGING_CONFIG" $JAVA_OPTS $CATALINA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" start \
+ >> "$CATALINA_OUT" 2>&1 &
+
+ fi
+
+ if [ ! -z "$CATALINA_PID" ]; then
+ echo $! > "$CATALINA_PID"
+ fi
+
+elif [ "$1" = "stop" ] ; then
+
+ shift
+
+ SLEEP=5
+ if [ ! -z "$1" ]; then
+ echo $1 | grep "[^0-9]" >/dev/null 2>&1
+ if [ $? -gt 0 ]; then
+ SLEEP=$1
+ shift
+ fi
+ fi
+
+ FORCE=0
+ if [ "$1" = "-force" ]; then
+ shift
+ FORCE=1
+ fi
+
+ if [ ! -z "$CATALINA_PID" ]; then
+ if [ -f "$CATALINA_PID" ]; then
+ if [ -s "$CATALINA_PID" ]; then
+ kill -0 `cat "$CATALINA_PID"` >/dev/null 2>&1
+ if [ $? -gt 0 ]; then
+ echo "PID file found but no matching process was found. Stop aborted."
+ exit 1
+ fi
+ else
+ echo "PID file is empty and has been ignored."
+ fi
+ else
+ echo "\$CATALINA_PID was set but the specified file does not exist. Is Tomcat running? Stop aborted."
+ exit 1
+ fi
+ fi
+
+ "$_RUNJAVA" $JAVA_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Dcatalina.base="$CATALINA_BASE" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ -Djava.io.tmpdir="$CATALINA_TMPDIR" \
+ org.apache.catalina.startup.Bootstrap "$@" stop
+
+ if [ ! -z "$CATALINA_PID" ]; then
+ if [ -f "$CATALINA_PID" ]; then
+ while [ $SLEEP -ge 0 ]; do
+ kill -0 `cat "$CATALINA_PID"` >/dev/null 2>&1
+ if [ $? -gt 0 ]; then
+ rm -f "$CATALINA_PID" >/dev/null 2>&1
+ if [ $? != 0 ]; then
+ if [ -w "$CATALINA_PID" ]; then
+ cat /dev/null > "$CATALINA_PID"
+ else
+ echo "Tomcat stopped but the PID file could not be removed or cleared."
+ fi
+ fi
+ break
+ fi
+ if [ $SLEEP -gt 0 ]; then
+ sleep 1
+ fi
+ if [ $SLEEP -eq 0 ]; then
+ if [ $FORCE -eq 0 ]; then
+ echo "Tomcat did not stop in time. PID file was not removed."
+ fi
+ fi
+ SLEEP=`expr $SLEEP - 1 `
+ done
+ fi
+ fi
+
+ if [ $FORCE -eq 1 ]; then
+ if [ -z "$CATALINA_PID" ]; then
+ echo "Kill failed: \$CATALINA_PID not set"
+ else
+ if [ -f "$CATALINA_PID" ]; then
+ PID=`cat "$CATALINA_PID"`
+ echo "Killing Tomcat with the PID: $PID"
+ kill -9 $PID
+ rm -f "$CATALINA_PID" >/dev/null 2>&1
+ if [ $? != 0 ]; then
+ echo "Tomcat was killed but the PID file could not be removed."
+ fi
+ fi
+ fi
+ fi
+
+elif [ "$1" = "version" ] ; then
+
+ "$_RUNJAVA" \
+ -classpath "$CATALINA_HOME/lib/catalina.jar" \
+ org.apache.catalina.util.ServerInfo
+
+else
+
+ echo "Usage: catalina.sh ( commands ... )"
+ echo "commands:"
+ if $os400; then
+ echo " debug Start Catalina in a debugger (not available on OS400)"
+ echo " debug -security Debug Catalina with a security manager (not available on OS400)"
+ else
+ echo " debug Start Catalina in a debugger"
+ echo " debug -security Debug Catalina with a security manager"
+ fi
+ echo " jpda start Start Catalina under JPDA debugger"
+ echo " run Start Catalina in the current window"
+ echo " run -security Start in the current window with security manager"
+ echo " start Start Catalina in a separate window"
+ echo " start -security Start in a separate window with security manager"
+ echo " stop Stop Catalina, waiting up to 5 seconds for the process to end"
+ echo " stop n Stop Catalina, waiting up to n seconds for the process to end"
+ echo " stop -force Stop Catalina, wait up to 5 seconds and then use kill -KILL if still running"
+ echo " stop n -force Stop Catalina, wait up to n seconds and then use kill -KILL if still running"
+ echo " version What version of tomcat are you running?"
+ echo "Note: Waiting for the process to end and use of the -force option require that \$CATALINA_PID is defined"
+ exit 1
+
+fi
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon-native.tar.gz b/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon-native.tar.gz
new file mode 100644
index 0000000..fcb4da7
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon-native.tar.gz differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon.jar b/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon.jar
new file mode 100644
index 0000000..868ae2e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/bin/commons-daemon.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/cpappend.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/cpappend.bat
new file mode 100644
index 0000000..b9b90c1
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/cpappend.bat
@@ -0,0 +1,35 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+rem ---------------------------------------------------------------------------
+rem Append to CLASSPATH
+rem
+rem $Id: cpappend.bat 562770 2007-08-04 22:13:58Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Process the first argument
+if ""%1"" == """" goto end
+set CLASSPATH=%CLASSPATH%;%1
+shift
+
+rem Process the remaining arguments
+:setArgs
+if ""%1"" == """" goto doneSetArgs
+set CLASSPATH=%CLASSPATH% %1
+shift
+goto setArgs
+:doneSetArgs
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/digest.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/digest.bat
new file mode 100644
index 0000000..32ffb3a
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/digest.bat
@@ -0,0 +1,56 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Script to digest password using the algorithm specified
+rem
+rem $Id: digest.bat 908749 2010-02-10 23:26:42Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+if not "%CATALINA_HOME%" == "" goto gotHome
+set CATALINA_HOME=.
+if exist "%CATALINA_HOME%\bin\tool-wrapper.bat" goto okHome
+set CATALINA_HOME=..
+:gotHome
+if exist "%CATALINA_HOME%\bin\tool-wrapper.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+set "EXECUTABLE=%CATALINA_HOME%\bin\tool-wrapper.bat"
+
+rem Check that target executable exists
+if exist "%EXECUTABLE%" goto okExec
+echo Cannot find "%EXECUTABLE%"
+echo This file is needed to run this program
+goto end
+:okExec
+
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+call "%EXECUTABLE%" -server org.apache.catalina.realm.RealmBase %CMD_LINE_ARGS%
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/digest.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/digest.sh
new file mode 100755
index 0000000..88fd456
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/digest.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Script to digest password using the algorithm specified
+#
+# $Id: digest.sh 1130937 2011-06-03 08:27:13Z markt $
+# -----------------------------------------------------------------------------
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+PRGDIR=`dirname "$PRG"`
+EXECUTABLE=tool-wrapper.sh
+
+# Check that target executable exists
+if [ ! -x "$PRGDIR"/"$EXECUTABLE" ]; then
+ echo "Cannot find $PRGDIR/$EXECUTABLE"
+ echo "The file is absent or does not have execute permission"
+ echo "This file is needed to run this program"
+ exit 1
+fi
+
+exec "$PRGDIR"/"$EXECUTABLE" -server org.apache.catalina.realm.RealmBase "$@"
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.bat
new file mode 100644
index 0000000..2868cea
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.bat
@@ -0,0 +1,82 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+rem ---------------------------------------------------------------------------
+rem Set CLASSPATH and Java options
+rem
+rem $Id: setclasspath.bat 908749 2010-02-10 23:26:42Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Make sure prerequisite environment variables are set
+if not "%JAVA_HOME%" == "" goto gotJdkHome
+if not "%JRE_HOME%" == "" goto gotJreHome
+echo Neither the JAVA_HOME nor the JRE_HOME environment variable is defined
+echo At least one of these environment variable is needed to run this program
+goto exit
+
+:gotJreHome
+if not exist "%JRE_HOME%\bin\java.exe" goto noJavaHome
+if not exist "%JRE_HOME%\bin\javaw.exe" goto noJavaHome
+if not ""%1"" == ""debug"" goto okJavaHome
+echo JAVA_HOME should point to a JDK in order to run in debug mode.
+goto exit
+
+:gotJdkHome
+if not exist "%JAVA_HOME%\bin\java.exe" goto noJavaHome
+if not exist "%JAVA_HOME%\bin\javaw.exe" goto noJavaHome
+if not exist "%JAVA_HOME%\bin\jdb.exe" goto noJavaHome
+if not exist "%JAVA_HOME%\bin\javac.exe" goto noJavaHome
+if not "%JRE_HOME%" == "" goto okJavaHome
+set "JRE_HOME=%JAVA_HOME%"
+goto okJavaHome
+
+:noJavaHome
+echo The JAVA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+echo NB: JAVA_HOME should point to a JDK not a JRE
+goto exit
+:okJavaHome
+
+if not "%BASEDIR%" == "" goto gotBasedir
+echo The BASEDIR environment variable is not defined
+echo This environment variable is needed to run this program
+goto exit
+:gotBasedir
+if exist "%BASEDIR%\bin\setclasspath.bat" goto okBasedir
+echo The BASEDIR environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto exit
+:okBasedir
+
+rem Don't override the endorsed dir if the user has set it previously
+if not "%JAVA_ENDORSED_DIRS%" == "" goto gotEndorseddir
+rem Set the default -Djava.endorsed.dirs argument
+set "JAVA_ENDORSED_DIRS=%BASEDIR%\endorsed"
+:gotEndorseddir
+
+rem Set standard command for invoking Java.
+rem Note that NT requires a window name argument when using start.
+rem Also note the quoting as JAVA_HOME may contain spaces.
+set _RUNJAVA="%JRE_HOME%\bin\java"
+set _RUNJDB="%JAVA_HOME%\bin\jdb"
+
+goto end
+
+:exit
+exit /b 1
+
+:end
+exit /b 0
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.sh
new file mode 100755
index 0000000..b608691
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/setclasspath.sh
@@ -0,0 +1,116 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Set CLASSPATH and Java options
+#
+# $Id: setclasspath.sh 795037 2009-07-17 10:52:16Z markt $
+# -----------------------------------------------------------------------------
+
+# Make sure prerequisite environment variables are set
+if [ -z "$JAVA_HOME" -a -z "$JRE_HOME" ]; then
+ # Bugzilla 37284 (reviewed).
+ if $darwin; then
+ if [ -d "/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home" ]; then
+ export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home"
+ fi
+ else
+ JAVA_PATH=`which java 2>/dev/null`
+ if [ "x$JAVA_PATH" != "x" ]; then
+ JAVA_PATH=`dirname $JAVA_PATH 2>/dev/null`
+ JRE_HOME=`dirname $JAVA_PATH 2>/dev/null`
+ fi
+ if [ "x$JRE_HOME" = "x" ]; then
+ # XXX: Should we try other locations?
+ if [ -x /usr/bin/java ]; then
+ JRE_HOME=/usr
+ fi
+ fi
+ fi
+ if [ -z "$JAVA_HOME" -a -z "$JRE_HOME" ]; then
+ echo "Neither the JAVA_HOME nor the JRE_HOME environment variable is defined"
+ echo "At least one of these environment variable is needed to run this program"
+ exit 1
+ fi
+fi
+if [ -z "$JAVA_HOME" -a "$1" = "debug" ]; then
+ echo "JAVA_HOME should point to a JDK in order to run in debug mode."
+ exit 1
+fi
+if [ -z "$JRE_HOME" ]; then
+ JRE_HOME="$JAVA_HOME"
+fi
+
+# If we're running under jdb, we need a full jdk.
+if [ "$1" = "debug" ] ; then
+ if [ "$os400" = "true" ]; then
+ if [ ! -x "$JAVA_HOME"/bin/java -o ! -x "$JAVA_HOME"/bin/javac ]; then
+ echo "The JAVA_HOME environment variable is not defined correctly"
+ echo "This environment variable is needed to run this program"
+ echo "NB: JAVA_HOME should point to a JDK not a JRE"
+ exit 1
+ fi
+ else
+ if [ ! -x "$JAVA_HOME"/bin/java -o ! -x "$JAVA_HOME"/bin/jdb -o ! -x "$JAVA_HOME"/bin/javac ]; then
+ echo "The JAVA_HOME environment variable is not defined correctly"
+ echo "This environment variable is needed to run this program"
+ echo "NB: JAVA_HOME should point to a JDK not a JRE"
+ exit 1
+ fi
+ fi
+fi
+if [ -z "$BASEDIR" ]; then
+ echo "The BASEDIR environment variable is not defined"
+ echo "This environment variable is needed to run this program"
+ exit 1
+fi
+if [ ! -x "$BASEDIR"/bin/setclasspath.sh ]; then
+ if $os400; then
+ # -x will Only work on the os400 if the files are:
+ # 1. owned by the user
+ # 2. owned by the PRIMARY group of the user
+ # this will not work if the user belongs in secondary groups
+ eval
+ else
+ echo "The BASEDIR environment variable is not defined correctly"
+ echo "This environment variable is needed to run this program"
+ exit 1
+ fi
+fi
+
+# Don't override the endorsed dir if the user has set it previously
+if [ -z "$JAVA_ENDORSED_DIRS" ]; then
+ # Set the default -Djava.endorsed.dirs argument
+ JAVA_ENDORSED_DIRS="$BASEDIR"/endorsed
+fi
+
+# OSX hack to CLASSPATH
+JIKESPATH=
+if [ `uname -s` = "Darwin" ]; then
+ OSXHACK="/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Classes"
+ if [ -d "$OSXHACK" ]; then
+ for i in "$OSXHACK"/*.jar; do
+ JIKESPATH="$JIKESPATH":"$i"
+ done
+ fi
+fi
+
+# Set standard commands for invoking Java.
+_RUNJAVA="$JRE_HOME"/bin/java
+if [ "$os400" != "true" ]; then
+ _RUNJDB="$JAVA_HOME"/bin/jdb
+fi
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.bat
new file mode 100644
index 0000000..c1843a3
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.bat
@@ -0,0 +1,59 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Stop script for the CATALINA Server
+rem
+rem $Id: shutdown.bat 908749 2010-02-10 23:26:42Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+set "CURRENT_DIR=%cd%"
+if not "%CATALINA_HOME%" == "" goto gotHome
+set "CATALINA_HOME=%CURRENT_DIR%"
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+cd ..
+set "CATALINA_HOME=%cd%"
+cd "%CURRENT_DIR%"
+:gotHome
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+set "EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat"
+
+rem Check that target executable exists
+if exist "%EXECUTABLE%" goto okExec
+echo Cannot find "%EXECUTABLE%"
+echo This file is needed to run this program
+goto end
+:okExec
+
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+call "%EXECUTABLE%" stop %CMD_LINE_ARGS%
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.sh
new file mode 100755
index 0000000..92a8a28
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/shutdown.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Stop script for the CATALINA Server
+#
+# $Id: shutdown.sh 1130937 2011-06-03 08:27:13Z markt $
+# -----------------------------------------------------------------------------
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+PRGDIR=`dirname "$PRG"`
+EXECUTABLE=catalina.sh
+
+# Check that target executable exists
+if [ ! -x "$PRGDIR"/"$EXECUTABLE" ]; then
+ echo "Cannot find $PRGDIR/$EXECUTABLE"
+ echo "The file is absent or does not have execute permission"
+ echo "This file is needed to run this program"
+ exit 1
+fi
+
+exec "$PRGDIR"/"$EXECUTABLE" stop "$@"
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/startup.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/startup.bat
new file mode 100644
index 0000000..6f9a9b8
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/startup.bat
@@ -0,0 +1,59 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Start script for the CATALINA Server
+rem
+rem $Id: startup.bat 908749 2010-02-10 23:26:42Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+set "CURRENT_DIR=%cd%"
+if not "%CATALINA_HOME%" == "" goto gotHome
+set "CATALINA_HOME=%CURRENT_DIR%"
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+cd ..
+set "CATALINA_HOME=%cd%"
+cd "%CURRENT_DIR%"
+:gotHome
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+set "EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat"
+
+rem Check that target executable exists
+if exist "%EXECUTABLE%" goto okExec
+echo Cannot find "%EXECUTABLE%"
+echo This file is needed to run this program
+goto end
+:okExec
+
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+call "%EXECUTABLE%" start %CMD_LINE_ARGS%
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/startup.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/startup.sh
new file mode 100755
index 0000000..4702636
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/startup.sh
@@ -0,0 +1,65 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Start Script for the CATALINA Server
+#
+# $Id: startup.sh 1130937 2011-06-03 08:27:13Z markt $
+# -----------------------------------------------------------------------------
+
+# Better OS/400 detection: see Bugzilla 31132
+os400=false
+darwin=false
+case "`uname`" in
+CYGWIN*) cygwin=true;;
+OS400*) os400=true;;
+Darwin*) darwin=true;;
+esac
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+PRGDIR=`dirname "$PRG"`
+EXECUTABLE=catalina.sh
+
+# Check that target executable exists
+if $os400; then
+ # -x will Only work on the os400 if the files are:
+ # 1. owned by the user
+ # 2. owned by the PRIMARY group of the user
+ # this will not work if the user belongs in secondary groups
+ eval
+else
+ if [ ! -x "$PRGDIR"/"$EXECUTABLE" ]; then
+ echo "Cannot find $PRGDIR/$EXECUTABLE"
+ echo "The file is absent or does not have execute permission"
+ echo "This file is needed to run this program"
+ exit 1
+ fi
+fi
+
+exec "$PRGDIR"/"$EXECUTABLE" start "$@"
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-juli.jar b/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-juli.jar
new file mode 100644
index 0000000..2fcbafd
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-juli.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-native.tar.gz b/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-native.tar.gz
new file mode 100644
index 0000000..c6a9452
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/bin/tomcat-native.tar.gz differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.bat
new file mode 100644
index 0000000..e1955b8
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.bat
@@ -0,0 +1,85 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Wrapper script for command line tools
+rem
+rem Environment Variable Prerequisites
+rem
+rem CATALINA_HOME May point at your Catalina "build" directory.
+rem
+rem TOOL_OPTS (Optional) Java runtime options used when the "start",
+rem "stop", or "run" command is executed.
+rem
+rem JAVA_HOME Must point at your Java Development Kit installation.
+rem
+rem JAVA_OPTS (Optional) Java runtime options used when the "start",
+rem "stop", or "run" command is executed.
+rem
+rem $Id: tool-wrapper.bat 1040555 2010-11-30 15:00:25Z rjung $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+if not "%CATALINA_HOME%" == "" goto gotHome
+set CATALINA_HOME=.
+if exist "%CATALINA_HOME%\bin\tool-wrapper.bat" goto okHome
+set CATALINA_HOME=..
+:gotHome
+if exist "%CATALINA_HOME%\bin\tool-wrapper.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+rem Ensure that any user defined CLASSPATH variables are not used on startup,
+rem but allow them to be specified in setenv.bat, in rare case when it is needed.
+set CLASSPATH=
+
+rem Get standard environment variables
+if exist "%CATALINA_HOME%\bin\setenv.bat" call "%CATALINA_HOME%\bin\setenv.bat"
+
+rem Get standard Java environment variables
+if exist "%CATALINA_HOME%\bin\setclasspath.bat" goto okSetclasspath
+echo Cannot find "%CATALINA_HOME%\bin\setclasspath.bat"
+echo This file is needed to run this program
+goto end
+:okSetclasspath
+set "BASEDIR=%CATALINA_HOME%"
+call "%CATALINA_HOME%\bin\setclasspath.bat"
+
+rem Add on extra jar files to CLASSPATH
+rem Note that there are no quotes as we do not want to introduce random
+rem quotes into the CLASSPATH
+if "%CLASSPATH%" == "" goto noclasspath
+set "CLASSPATH=%CLASSPATH%;%CATALINA_HOME%\bin\bootstrap.jar;%BASEDIR%\lib\servlet-api.jar"
+goto okclasspath
+:noclasspath
+set "CLASSPATH=%CATALINA_HOME%\bin\bootstrap.jar;%BASEDIR%\lib\servlet-api.jar"
+:okclasspath
+
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+%_RUNJAVA% %JAVA_OPTS% %TOOL_OPTS% -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%" -Dcatalina.home="%CATALINA_HOME%" org.apache.catalina.startup.Tool %CMD_LINE_ARGS%
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.sh
new file mode 100755
index 0000000..11ca986
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/tool-wrapper.sh
@@ -0,0 +1,99 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Wrapper script for command line tools
+#
+# Environment Variable Prerequisites
+#
+# CATALINA_HOME May point at your Catalina "build" directory.
+#
+# TOOL_OPTS (Optional) Java runtime options used when the "start",
+# "stop", or "run" command is executed.
+#
+# JAVA_HOME Must point at your Java Development Kit installation.
+#
+# JAVA_OPTS (Optional) Java runtime options used when the "start",
+# "stop", or "run" command is executed.
+#
+# $Id: tool-wrapper.sh 1040555 2010-11-30 15:00:25Z rjung $
+# -----------------------------------------------------------------------------
+
+# OS specific support. $var _must_ be set to either true or false.
+cygwin=false
+case "`uname`" in
+CYGWIN*) cygwin=true;;
+esac
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ]; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+# Get standard environment variables
+PRGDIR=`dirname "$PRG"`
+CATALINA_HOME=`cd "$PRGDIR/.." >/dev/null; pwd`
+
+# Ensure that any user defined CLASSPATH variables are not used on startup,
+# but allow them to be specified in setenv.sh, in rare case when it is needed.
+CLASSPATH=
+
+if [ -r "$CATALINA_HOME"/bin/setenv.sh ]; then
+ . "$CATALINA_HOME"/bin/setenv.sh
+fi
+
+# For Cygwin, ensure paths are in UNIX format before anything is touched
+if $cygwin; then
+ [ -n "$JAVA_HOME" ] && JAVA_HOME=`cygpath --unix "$JAVA_HOME"`
+ [ -n "$CATALINA_HOME" ] && CATALINA_HOME=`cygpath --unix "$CATALINA_HOME"`
+ [ -n "$CLASSPATH" ] && CLASSPATH=`cygpath --path --unix "$CLASSPATH"`
+fi
+
+# Get standard Java environment variables
+if [ -r "$CATALINA_HOME"/bin/setclasspath.sh ]; then
+ BASEDIR="$CATALINA_HOME"
+ . "$CATALINA_HOME"/bin/setclasspath.sh
+else
+ echo "Cannot find $CATALINA_HOME/bin/setclasspath.sh"
+ echo "This file is needed to run this program"
+ exit 1
+fi
+
+# Add on extra jar files to CLASSPATH
+CLASSPATH="$CLASSPATH":"$CATALINA_HOME"/bin/bootstrap.jar:"$BASEDIR"/lib/servlet-api.jar
+
+# For Cygwin, switch paths to Windows format before running java
+if $cygwin; then
+ JAVA_HOME=`cygpath --path --windows "$JAVA_HOME"`
+ CATALINA_HOME=`cygpath --path --windows "$CATALINA_HOME"`
+ CLASSPATH=`cygpath --path --windows "$CLASSPATH"`
+fi
+
+# ----- Execute The Requested Command -----------------------------------------
+
+exec "$_RUNJAVA" $JAVA_OPTS $TOOL_OPTS \
+ -Djava.endorsed.dirs="$JAVA_ENDORSED_DIRS" -classpath "$CLASSPATH" \
+ -Dcatalina.home="$CATALINA_HOME" \
+ org.apache.catalina.startup.Tool "$@"
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/version.bat b/x86_64/share/hadoop/httpfs/tomcat/bin/version.bat
new file mode 100644
index 0000000..8197fd6
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/version.bat
@@ -0,0 +1,59 @@
+@echo off
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements. See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License. You may obtain a copy of the License at
+rem
+rem http://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+
+if "%OS%" == "Windows_NT" setlocal
+rem ---------------------------------------------------------------------------
+rem Version script for the CATALINA Server
+rem
+rem $Id: version.bat 908749 2010-02-10 23:26:42Z markt $
+rem ---------------------------------------------------------------------------
+
+rem Guess CATALINA_HOME if not defined
+set "CURRENT_DIR=%cd%"
+if not "%CATALINA_HOME%" == "" goto gotHome
+set "CATALINA_HOME=%CURRENT_DIR%"
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+cd ..
+set "CATALINA_HOME=%cd%"
+cd "%CURRENT_DIR%"
+:gotHome
+if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
+echo The CATALINA_HOME environment variable is not defined correctly
+echo This environment variable is needed to run this program
+goto end
+:okHome
+
+set "EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat"
+
+rem Check that target executable exists
+if exist "%EXECUTABLE%" goto okExec
+echo Cannot find "%EXECUTABLE%"
+echo This file is needed to run this program
+goto end
+:okExec
+
+rem Get remaining unshifted command line arguments and save them in the
+set CMD_LINE_ARGS=
+:setArgs
+if ""%1""=="""" goto doneSetArgs
+set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
+shift
+goto setArgs
+:doneSetArgs
+
+call "%EXECUTABLE%" version %CMD_LINE_ARGS%
+
+:end
diff --git a/x86_64/share/hadoop/httpfs/tomcat/bin/version.sh b/x86_64/share/hadoop/httpfs/tomcat/bin/version.sh
new file mode 100755
index 0000000..1b45474
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/bin/version.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# -----------------------------------------------------------------------------
+# Version Script for the CATALINA Server
+#
+# $Id: version.sh 1130937 2011-06-03 08:27:13Z markt $
+# -----------------------------------------------------------------------------
+
+# resolve links - $0 may be a softlink
+PRG="$0"
+
+while [ -h "$PRG" ] ; do
+ ls=`ls -ld "$PRG"`
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ PRG="$link"
+ else
+ PRG=`dirname "$PRG"`/"$link"
+ fi
+done
+
+PRGDIR=`dirname "$PRG"`
+EXECUTABLE=catalina.sh
+
+# Check that target executable exists
+if [ ! -x "$PRGDIR"/"$EXECUTABLE" ]; then
+ echo "Cannot find $PRGDIR/$EXECUTABLE"
+ echo "The file is absent or does not have execute permission"
+ echo "This file is needed to run this program"
+ exit 1
+fi
+
+exec "$PRGDIR"/"$EXECUTABLE" version "$@"
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.policy b/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.policy
new file mode 100644
index 0000000..e58d154
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.policy
@@ -0,0 +1,222 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// ============================================================================
+// catalina.policy - Security Policy Permissions for Tomcat 6
+//
+// This file contains a default set of security policies to be enforced (by the
+// JVM) when Catalina is executed with the "-security" option. In addition
+// to the permissions granted here, the following additional permissions are
+// granted to the codebase specific to each web application:
+//
+// * Read access to its document root directory
+// * Read, write and delete access to its working directory
+//
+// $Id: catalina.policy 1135491 2011-06-14 11:27:38Z markt $
+// ============================================================================
+
+
+// ========== SYSTEM CODE PERMISSIONS =========================================
+
+
+// These permissions apply to javac
+grant codeBase "file:${java.home}/lib/-" {
+ permission java.security.AllPermission;
+};
+
+// These permissions apply to all shared system extensions
+grant codeBase "file:${java.home}/jre/lib/ext/-" {
+ permission java.security.AllPermission;
+};
+
+// These permissions apply to javac when ${java.home] points at $JAVA_HOME/jre
+grant codeBase "file:${java.home}/../lib/-" {
+ permission java.security.AllPermission;
+};
+
+// These permissions apply to all shared system extensions when
+// ${java.home} points at $JAVA_HOME/jre
+grant codeBase "file:${java.home}/lib/ext/-" {
+ permission java.security.AllPermission;
+};
+
+
+// ========== CATALINA CODE PERMISSIONS =======================================
+
+
+// These permissions apply to the daemon code
+grant codeBase "file:${catalina.home}/bin/commons-daemon.jar" {
+ permission java.security.AllPermission;
+};
+
+// These permissions apply to the logging API
+// Note: If tomcat-juli.jar is in ${catalina.base} and not in ${catalina.home},
+// update this section accordingly.
+// grant codeBase "file:${catalina.base}/bin/tomcat-juli.jar" {..}
+grant codeBase "file:${catalina.home}/bin/tomcat-juli.jar" {
+ permission java.io.FilePermission
+ "${java.home}${file.separator}lib${file.separator}logging.properties", "read";
+
+ permission java.io.FilePermission
+ "${catalina.base}${file.separator}conf${file.separator}logging.properties", "read";
+ permission java.io.FilePermission
+ "${catalina.base}${file.separator}logs", "read, write";
+ permission java.io.FilePermission
+ "${catalina.base}${file.separator}logs${file.separator}*", "read, write";
+
+ permission java.lang.RuntimePermission "shutdownHooks";
+ permission java.lang.RuntimePermission "getClassLoader";
+ permission java.lang.RuntimePermission "setContextClassLoader";
+
+ permission java.util.logging.LoggingPermission "control";
+
+ permission java.util.PropertyPermission "java.util.logging.config.class", "read";
+ permission java.util.PropertyPermission "java.util.logging.config.file", "read";
+ permission java.util.PropertyPermission "catalina.base", "read";
+
+ // Note: To enable per context logging configuration, permit read access to
+ // the appropriate file. Be sure that the logging configuration is
+ // secure before enabling such access.
+ // E.g. for the examples web application (uncomment and unwrap
+ // the following to be on a single line):
+ // permission java.io.FilePermission "${catalina.base}${file.separator}
+ // webapps${file.separator}examples${file.separator}WEB-INF
+ // ${file.separator}classes${file.separator}logging.properties", "read";
+};
+
+// These permissions apply to the server startup code
+grant codeBase "file:${catalina.home}/bin/bootstrap.jar" {
+ permission java.security.AllPermission;
+};
+
+// These permissions apply to the servlet API classes
+// and those that are shared across all class loaders
+// located in the "lib" directory
+grant codeBase "file:${catalina.home}/lib/-" {
+ permission java.security.AllPermission;
+};
+
+
+// If using a per instance lib directory, i.e. ${catalina.base}/lib,
+// then the following permission will need to be uncommented
+// grant codeBase "file:${catalina.base}/lib/-" {
+// permission java.security.AllPermission;
+// };
+
+
+// ========== WEB APPLICATION PERMISSIONS =====================================
+
+
+// These permissions are granted by default to all web applications
+// In addition, a web application will be given a read FilePermission
+// and JndiPermission for all files and directories in its document root.
+grant {
+ // Required for JNDI lookup of named JDBC DataSource's and
+ // javamail named MimePart DataSource used to send mail
+ permission java.util.PropertyPermission "java.home", "read";
+ permission java.util.PropertyPermission "java.naming.*", "read";
+ permission java.util.PropertyPermission "javax.sql.*", "read";
+
+ // OS Specific properties to allow read access
+ permission java.util.PropertyPermission "os.name", "read";
+ permission java.util.PropertyPermission "os.version", "read";
+ permission java.util.PropertyPermission "os.arch", "read";
+ permission java.util.PropertyPermission "file.separator", "read";
+ permission java.util.PropertyPermission "path.separator", "read";
+ permission java.util.PropertyPermission "line.separator", "read";
+
+ // JVM properties to allow read access
+ permission java.util.PropertyPermission "java.version", "read";
+ permission java.util.PropertyPermission "java.vendor", "read";
+ permission java.util.PropertyPermission "java.vendor.url", "read";
+ permission java.util.PropertyPermission "java.class.version", "read";
+ permission java.util.PropertyPermission "java.specification.version", "read";
+ permission java.util.PropertyPermission "java.specification.vendor", "read";
+ permission java.util.PropertyPermission "java.specification.name", "read";
+
+ permission java.util.PropertyPermission "java.vm.specification.version", "read";
+ permission java.util.PropertyPermission "java.vm.specification.vendor", "read";
+ permission java.util.PropertyPermission "java.vm.specification.name", "read";
+ permission java.util.PropertyPermission "java.vm.version", "read";
+ permission java.util.PropertyPermission "java.vm.vendor", "read";
+ permission java.util.PropertyPermission "java.vm.name", "read";
+
+ // Required for OpenJMX
+ permission java.lang.RuntimePermission "getAttribute";
+
+ // Allow read of JAXP compliant XML parser debug
+ permission java.util.PropertyPermission "jaxp.debug", "read";
+
+ // Precompiled JSPs need access to these packages.
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.jasper.el";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.jasper.runtime";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.jasper.runtime.*";
+
+ // Precompiled JSPs need access to these system properties.
+ permission java.util.PropertyPermission
+ "org.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER", "read";
+ permission java.util.PropertyPermission "org.apache.el.parser.COERCE_TO_ZERO", "read";
+};
+
+
+// The Manager application needs access to the following packages to support the
+// session display functionality. These settings support the following
+// configurations:
+// - default CATALINA_HOME == CATALINA_BASE
+// - CATALINA_HOME != CATALINA_BASE, per instance Manager in CATALINA_BASE
+// - CATALINA_HOME != CATALINA_BASE, shared Manager in CATALINA_HOME
+grant codeBase "file:${catalina.base}/webapps/manager/-" {
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina.manager";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina.manager.util";
+};
+grant codeBase "file:${catalina.home}/webapps/manager/-" {
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina.manager";
+ permission java.lang.RuntimePermission "accessClassInPackage.org.apache.catalina.manager.util";
+};
+
+// You can assign additional permissions to particular web applications by
+// adding additional "grant" entries here, based on the code base for that
+// application, /WEB-INF/classes/, or /WEB-INF/lib/ jar files.
+//
+// Different permissions can be granted to JSP pages, classes loaded from
+// the /WEB-INF/classes/ directory, all jar files in the /WEB-INF/lib/
+// directory, or even to individual jar files in the /WEB-INF/lib/ directory.
+//
+// For instance, assume that the standard "examples" application
+// included a JDBC driver that needed to establish a network connection to the
+// corresponding database and used the scrape taglib to get the weather from
+// the NOAA web server. You might create a "grant" entries like this:
+//
+// The permissions granted to the context root directory apply to JSP pages.
+// grant codeBase "file:${catalina.base}/webapps/examples/-" {
+// permission java.net.SocketPermission "dbhost.mycompany.com:5432", "connect";
+// permission java.net.SocketPermission "*.noaa.gov:80", "connect";
+// };
+//
+// The permissions granted to the context WEB-INF/classes directory
+// grant codeBase "file:${catalina.base}/webapps/examples/WEB-INF/classes/-" {
+// };
+//
+// The permission granted to your JDBC driver
+// grant codeBase "jar:file:${catalina.base}/webapps/examples/WEB-INF/lib/driver.jar!/-" {
+// permission java.net.SocketPermission "dbhost.mycompany.com:5432", "connect";
+// };
+// The permission granted to the scrape taglib
+// grant codeBase "jar:file:${catalina.base}/webapps/examples/WEB-INF/lib/scrape.jar!/-" {
+// permission java.net.SocketPermission "*.noaa.gov:80", "connect";
+// };
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.properties b/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.properties
new file mode 100644
index 0000000..dc2db35
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/catalina.properties
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#
+# List of comma-separated packages that start with or equal this string
+# will cause a security exception to be thrown when
+# passed to checkPackageAccess unless the
+# corresponding RuntimePermission ("accessClassInPackage."+package) has
+# been granted.
+package.access=sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans.
+#
+# List of comma-separated packages that start with or equal this string
+# will cause a security exception to be thrown when
+# passed to checkPackageDefinition unless the
+# corresponding RuntimePermission ("defineClassInPackage."+package) has
+# been granted.
+#
+# by default, no packages are restricted for definition, and none of
+# the class loaders supplied with the JDK call checkPackageDefinition.
+#
+package.definition=sun.,java.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.
+
+#
+#
+# List of comma-separated paths defining the contents of the "common"
+# classloader. Prefixes should be used to define what is the repository type.
+# Path may be relative to the CATALINA_HOME or CATALINA_BASE path or absolute.
+# If left as blank,the JVM system loader will be used as Catalina's "common"
+# loader.
+# Examples:
+# "foo": Add this folder as a class repository
+# "foo/*.jar": Add all the JARs of the specified folder as class
+# repositories
+# "foo/bar.jar": Add bar.jar as a class repository
+common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar
+
+#
+# List of comma-separated paths defining the contents of the "server"
+# classloader. Prefixes should be used to define what is the repository type.
+# Path may be relative to the CATALINA_HOME or CATALINA_BASE path or absolute.
+# If left as blank, the "common" loader will be used as Catalina's "server"
+# loader.
+# Examples:
+# "foo": Add this folder as a class repository
+# "foo/*.jar": Add all the JARs of the specified folder as class
+# repositories
+# "foo/bar.jar": Add bar.jar as a class repository
+server.loader=
+
+#
+# List of comma-separated paths defining the contents of the "shared"
+# classloader. Prefixes should be used to define what is the repository type.
+# Path may be relative to the CATALINA_BASE path or absolute. If left as blank,
+# the "common" loader will be used as Catalina's "shared" loader.
+# Examples:
+# "foo": Add this folder as a class repository
+# "foo/*.jar": Add all the JARs of the specified folder as class
+# repositories
+# "foo/bar.jar": Add bar.jar as a class repository
+# Please note that for single jars, e.g. bar.jar, you need the URL form
+# starting with file:.
+shared.loader=
+
+#
+# String cache configuration.
+tomcat.util.buf.StringCache.byte.enabled=true
+#tomcat.util.buf.StringCache.char.enabled=true
+#tomcat.util.buf.StringCache.trainThreshold=500000
+#tomcat.util.buf.StringCache.cacheSize=5000
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/context.xml b/x86_64/share/hadoop/httpfs/tomcat/conf/context.xml
new file mode 100644
index 0000000..90bf554
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/context.xml
@@ -0,0 +1,35 @@
+
+
+
+
+
+
+ WEB-INF/web.xml
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/logging.properties b/x86_64/share/hadoop/httpfs/tomcat/conf/logging.properties
new file mode 100644
index 0000000..294ef74
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/logging.properties
@@ -0,0 +1,67 @@
+#
+# All Rights Reserved.
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
+
+.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
+
+############################################################
+# Handler specific properties.
+# Describes specific configuration info for Handlers.
+############################################################
+
+1catalina.org.apache.juli.FileHandler.level = FINE
+1catalina.org.apache.juli.FileHandler.directory = ${httpfs.log.dir}
+1catalina.org.apache.juli.FileHandler.prefix = httpfs-catalina.
+
+2localhost.org.apache.juli.FileHandler.level = FINE
+2localhost.org.apache.juli.FileHandler.directory = ${httpfs.log.dir}
+2localhost.org.apache.juli.FileHandler.prefix = httpfs-localhost.
+
+3manager.org.apache.juli.FileHandler.level = FINE
+3manager.org.apache.juli.FileHandler.directory = ${httpfs.log.dir}
+3manager.org.apache.juli.FileHandler.prefix = httpfs-manager.
+
+4host-manager.org.apache.juli.FileHandler.level = FINE
+4host-manager.org.apache.juli.FileHandler.directory = ${httpfs.log.dir}
+4host-manager.org.apache.juli.FileHandler.prefix = httpfs-host-manager.
+
+java.util.logging.ConsoleHandler.level = FINE
+java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
+
+
+############################################################
+# Facility specific properties.
+# Provides extra control for each logger.
+############################################################
+
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers = 2localhost.org.apache.juli.FileHandler
+
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level = INFO
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers = 3manager.org.apache.juli.FileHandler
+
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level = INFO
+org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers = 4host-manager.org.apache.juli.FileHandler
+
+# For example, set the com.xyz.foo logger to only log SEVERE
+# messages:
+#org.apache.catalina.startup.ContextConfig.level = FINE
+#org.apache.catalina.startup.HostConfig.level = FINE
+#org.apache.catalina.session.ManagerBase.level = FINE
+#org.apache.catalina.core.AprLifecycleListener.level=FINE
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/server.xml b/x86_64/share/hadoop/httpfs/tomcat/conf/server.xml
new file mode 100644
index 0000000..a425bdd
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/server.xml
@@ -0,0 +1,150 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/tomcat-users.xml b/x86_64/share/hadoop/httpfs/tomcat/conf/tomcat-users.xml
new file mode 100644
index 0000000..7f022ff
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/tomcat-users.xml
@@ -0,0 +1,36 @@
+
+
+
+
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/conf/web.xml b/x86_64/share/hadoop/httpfs/tomcat/conf/web.xml
new file mode 100644
index 0000000..165bc7c
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/conf/web.xml
@@ -0,0 +1,1249 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ default
+ org.apache.catalina.servlets.DefaultServlet
+
+ debug
+ 0
+
+
+ listings
+ false
+
+ 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ jsp
+ org.apache.jasper.servlet.JspServlet
+
+ fork
+ false
+
+
+ xpoweredBy
+ false
+
+ 3
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ default
+ /
+
+
+
+
+
+
+
+ jsp
+ *.jsp
+
+
+
+ jsp
+ *.jspx
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 30
+
+
+
+
+
+
+
+
+
+
+
+ abs
+ audio/x-mpeg
+
+
+ ai
+ application/postscript
+
+
+ aif
+ audio/x-aiff
+
+
+ aifc
+ audio/x-aiff
+
+
+ aiff
+ audio/x-aiff
+
+
+ aim
+ application/x-aim
+
+
+ art
+ image/x-jg
+
+
+ asf
+ video/x-ms-asf
+
+
+ asx
+ video/x-ms-asf
+
+
+ au
+ audio/basic
+
+
+ avi
+ video/x-msvideo
+
+
+ avx
+ video/x-rad-screenplay
+
+
+ bcpio
+ application/x-bcpio
+
+
+ bin
+ application/octet-stream
+
+
+ bmp
+ image/bmp
+
+
+ body
+ text/html
+
+
+ cdf
+ application/x-cdf
+
+
+ cer
+ application/x-x509-ca-cert
+
+
+ class
+ application/java
+
+
+ cpio
+ application/x-cpio
+
+
+ csh
+ application/x-csh
+
+
+ css
+ text/css
+
+
+ dib
+ image/bmp
+
+
+ doc
+ application/msword
+
+
+ dtd
+ application/xml-dtd
+
+
+ dv
+ video/x-dv
+
+
+ dvi
+ application/x-dvi
+
+
+ eps
+ application/postscript
+
+
+ etx
+ text/x-setext
+
+
+ exe
+ application/octet-stream
+
+
+ gif
+ image/gif
+
+
+ gtar
+ application/x-gtar
+
+
+ gz
+ application/x-gzip
+
+
+ hdf
+ application/x-hdf
+
+
+ hqx
+ application/mac-binhex40
+
+
+ htc
+ text/x-component
+
+
+ htm
+ text/html
+
+
+ html
+ text/html
+
+
+ hqx
+ application/mac-binhex40
+
+
+ ief
+ image/ief
+
+
+ jad
+ text/vnd.sun.j2me.app-descriptor
+
+
+ jar
+ application/java-archive
+
+
+ java
+ text/plain
+
+
+ jnlp
+ application/x-java-jnlp-file
+
+
+ jpe
+ image/jpeg
+
+
+ jpeg
+ image/jpeg
+
+
+ jpg
+ image/jpeg
+
+
+ js
+ text/javascript
+
+
+ jsf
+ text/plain
+
+
+ jspf
+ text/plain
+
+
+ kar
+ audio/x-midi
+
+
+ latex
+ application/x-latex
+
+
+ m3u
+ audio/x-mpegurl
+
+
+ mac
+ image/x-macpaint
+
+
+ man
+ application/x-troff-man
+
+
+ mathml
+ application/mathml+xml
+
+
+ me
+ application/x-troff-me
+
+
+ mid
+ audio/x-midi
+
+
+ midi
+ audio/x-midi
+
+
+ mif
+ application/x-mif
+
+
+ mov
+ video/quicktime
+
+
+ movie
+ video/x-sgi-movie
+
+
+ mp1
+ audio/x-mpeg
+
+
+ mp2
+ audio/x-mpeg
+
+
+ mp3
+ audio/x-mpeg
+
+
+ mp4
+ video/mp4
+
+
+ mpa
+ audio/x-mpeg
+
+
+ mpe
+ video/mpeg
+
+
+ mpeg
+ video/mpeg
+
+
+ mpega
+ audio/x-mpeg
+
+
+ mpg
+ video/mpeg
+
+
+ mpv2
+ video/mpeg2
+
+
+ ms
+ application/x-wais-source
+
+
+ nc
+ application/x-netcdf
+
+
+ oda
+ application/oda
+
+
+
+ odb
+ application/vnd.oasis.opendocument.database
+
+
+
+ odc
+ application/vnd.oasis.opendocument.chart
+
+
+
+ odf
+ application/vnd.oasis.opendocument.formula
+
+
+
+ odg
+ application/vnd.oasis.opendocument.graphics
+
+
+
+ odi
+ application/vnd.oasis.opendocument.image
+
+
+
+ odm
+ application/vnd.oasis.opendocument.text-master
+
+
+
+ odp
+ application/vnd.oasis.opendocument.presentation
+
+
+
+ ods
+ application/vnd.oasis.opendocument.spreadsheet
+
+
+
+ odt
+ application/vnd.oasis.opendocument.text
+
+
+ ogg
+ application/ogg
+
+
+
+ otg
+ application/vnd.oasis.opendocument.graphics-template
+
+
+
+ oth
+ application/vnd.oasis.opendocument.text-web
+
+
+
+ otp
+ application/vnd.oasis.opendocument.presentation-template
+
+
+
+ ots
+ application/vnd.oasis.opendocument.spreadsheet-template
+
+
+
+ ott
+ application/vnd.oasis.opendocument.text-template
+
+
+ pbm
+ image/x-portable-bitmap
+
+
+ pct
+ image/pict
+
+
+ pdf
+ application/pdf
+
+
+ pgm
+ image/x-portable-graymap
+
+
+ pic
+ image/pict
+
+
+ pict
+ image/pict
+
+
+ pls
+ audio/x-scpls
+
+
+ png
+ image/png
+
+
+ pnm
+ image/x-portable-anymap
+
+
+ pnt
+ image/x-macpaint
+
+
+ ppm
+ image/x-portable-pixmap
+
+
+ ppt
+ application/vnd.ms-powerpoint
+
+
+ pps
+ application/vnd.ms-powerpoint
+
+
+ ps
+ application/postscript
+
+
+ psd
+ image/x-photoshop
+
+
+ qt
+ video/quicktime
+
+
+ qti
+ image/x-quicktime
+
+
+ qtif
+ image/x-quicktime
+
+
+ ras
+ image/x-cmu-raster
+
+
+ rdf
+ application/rdf+xml
+
+
+ rgb
+ image/x-rgb
+
+
+ rm
+ application/vnd.rn-realmedia
+
+
+ roff
+ application/x-troff
+
+
+ rtf
+ application/rtf
+
+
+ rtx
+ text/richtext
+
+
+ sh
+ application/x-sh
+
+
+ shar
+ application/x-shar
+
+
+
+ smf
+ audio/x-midi
+
+
+ sit
+ application/x-stuffit
+
+
+ snd
+ audio/basic
+
+
+ src
+ application/x-wais-source
+
+
+ sv4cpio
+ application/x-sv4cpio
+
+
+ sv4crc
+ application/x-sv4crc
+
+
+ svg
+ image/svg+xml
+
+
+ svgz
+ image/svg+xml
+
+
+ swf
+ application/x-shockwave-flash
+
+
+ t
+ application/x-troff
+
+
+ tar
+ application/x-tar
+
+
+ tcl
+ application/x-tcl
+
+
+ tex
+ application/x-tex
+
+
+ texi
+ application/x-texinfo
+
+
+ texinfo
+ application/x-texinfo
+
+
+ tif
+ image/tiff
+
+
+ tiff
+ image/tiff
+
+
+ tr
+ application/x-troff
+
+
+ tsv
+ text/tab-separated-values
+
+
+ txt
+ text/plain
+
+
+ ulw
+ audio/basic
+
+
+ ustar
+ application/x-ustar
+
+
+ vxml
+ application/voicexml+xml
+
+
+ xbm
+ image/x-xbitmap
+
+
+ xht
+ application/xhtml+xml
+
+
+ xhtml
+ application/xhtml+xml
+
+
+ xls
+ application/vnd.ms-excel
+
+
+ xml
+ application/xml
+
+
+ xpm
+ image/x-xpixmap
+
+
+ xsl
+ application/xml
+
+
+ xslt
+ application/xslt+xml
+
+
+ xul
+ application/vnd.mozilla.xul+xml
+
+
+ xwd
+ image/x-xwindowdump
+
+
+ vsd
+ application/x-visio
+
+
+ wav
+ audio/x-wav
+
+
+
+ wbmp
+ image/vnd.wap.wbmp
+
+
+
+ wml
+ text/vnd.wap.wml
+
+
+
+ wmlc
+ application/vnd.wap.wmlc
+
+
+
+ wmls
+ text/vnd.wap.wmlscript
+
+
+
+ wmlscriptc
+ application/vnd.wap.wmlscriptc
+
+
+ wmv
+ video/x-ms-wmv
+
+
+ wrl
+ x-world/x-vrml
+
+
+ wspolicy
+ application/wspolicy+xml
+
+
+ Z
+ application/x-compress
+
+
+ z
+ application/x-compress
+
+
+ zip
+ application/zip
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ index.html
+ index.htm
+ index.jsp
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/annotations-api.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/annotations-api.jar
new file mode 100644
index 0000000..8d5d8f9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/annotations-api.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ant.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ant.jar
new file mode 100644
index 0000000..4646554
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ant.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ha.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ha.jar
new file mode 100644
index 0000000..440291d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-ha.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-tribes.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-tribes.jar
new file mode 100644
index 0000000..7ea726f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina-tribes.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/catalina.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina.jar
new file mode 100644
index 0000000..f18c20d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/catalina.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/ecj-3.7.2.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/ecj-3.7.2.jar
new file mode 100644
index 0000000..54ae692
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/ecj-3.7.2.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/el-api.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/el-api.jar
new file mode 100644
index 0000000..7503cda
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/el-api.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/jasper-el.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/jasper-el.jar
new file mode 100644
index 0000000..c51e275
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/jasper-el.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/jasper.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/jasper.jar
new file mode 100644
index 0000000..4728407
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/jasper.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/jsp-api.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/jsp-api.jar
new file mode 100644
index 0000000..3030459
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/jsp-api.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/servlet-api.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/servlet-api.jar
new file mode 100644
index 0000000..44f490c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/servlet-api.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-coyote.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-coyote.jar
new file mode 100644
index 0000000..e0cb2be
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-coyote.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-dbcp.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-dbcp.jar
new file mode 100644
index 0000000..8f85010
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-dbcp.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-es.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-es.jar
new file mode 100644
index 0000000..3871d47
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-es.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-fr.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-fr.jar
new file mode 100644
index 0000000..8991f2a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-fr.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-ja.jar b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-ja.jar
new file mode 100644
index 0000000..6c7f787
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/lib/tomcat-i18n-ja.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/temp/safeToDelete.tmp b/x86_64/share/hadoop/httpfs/tomcat/temp/safeToDelete.tmp
new file mode 100644
index 0000000..e69de29
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/WEB-INF/web.xml b/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/WEB-INF/web.xml
new file mode 100644
index 0000000..9d0ae0d
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/WEB-INF/web.xml
@@ -0,0 +1,16 @@
+
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/index.html b/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/index.html
new file mode 100644
index 0000000..2f9aa7a
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/ROOT/index.html
@@ -0,0 +1,21 @@
+
+
+
+
+HttpFs service, service base URL at /webhdfs/v1.
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/default-log4j.properties b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/default-log4j.properties
new file mode 100644
index 0000000..a0c6527
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/default-log4j.properties
@@ -0,0 +1,20 @@
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.Target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout
+log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
+log4j.rootLogger=INFO, console
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs-default.xml b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs-default.xml
new file mode 100644
index 0000000..87cd730
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs-default.xml
@@ -0,0 +1,237 @@
+
+
+
+
+
+
+
+
+ httpfs.buffer.size
+ 4096
+
+ The buffer size used by a read/write request when streaming data from/to
+ HDFS.
+
+
+
+
+
+
+ httpfs.services
+
+ org.apache.hadoop.lib.service.instrumentation.InstrumentationService,
+ org.apache.hadoop.lib.service.scheduler.SchedulerService,
+ org.apache.hadoop.lib.service.security.GroupsService,
+ org.apache.hadoop.lib.service.security.ProxyUserService,
+ org.apache.hadoop.lib.service.security.DelegationTokenManagerService,
+ org.apache.hadoop.lib.service.hadoop.FileSystemAccessService
+
+
+ Services used by the httpfs server.
+
+
+
+
+
+
+ kerberos.realm
+ LOCALHOST
+
+ Kerberos realm, used only if Kerberos authentication is used between
+ the clients and httpfs or between HttpFS and HDFS.
+
+ This property is only used to resolve other properties within this
+ configuration file.
+
+
+
+
+
+
+ httpfs.hostname
+ ${httpfs.http.hostname}
+
+ Property used to synthetize the HTTP Kerberos principal used by httpfs.
+
+ This property is only used to resolve other properties within this
+ configuration file.
+
+
+
+
+ httpfs.authentication.signature.secret.file
+ ${httpfs.config.dir}/httpfs-signature.secret
+
+ File containing the secret to sign HttpFS hadoop-auth cookies.
+
+ This file should be readable only by the system user running HttpFS service.
+
+ If multiple HttpFS servers are used in a load-balancer/round-robin fashion,
+ they should share the secret file.
+
+
+
+
+ httpfs.authentication.type
+ simple
+
+ Defines the authentication mechanism used by httpfs for its HTTP clients.
+
+ Valid values are 'simple' or 'kerberos'.
+
+ If using 'simple' HTTP clients must specify the username with the
+ 'user.name' query string parameter.
+
+ If using 'kerberos' HTTP clients must use HTTP SPNEGO or delegation tokens.
+
+
+
+
+ httpfs.authentication.kerberos.principal
+ HTTP/${httpfs.hostname}@${kerberos.realm}
+
+ The HTTP Kerberos principal used by HttpFS in the HTTP endpoint.
+
+ The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos
+ HTTP SPNEGO specification.
+
+
+
+
+ httpfs.authentication.kerberos.keytab
+ ${user.home}/httpfs.keytab
+
+ The Kerberos keytab file with the credentials for the
+ HTTP Kerberos principal used by httpfs in the HTTP endpoint.
+
+
+
+
+
+
+ httpfs.proxyuser.#USER#.hosts
+ *
+
+ List of hosts the '#USER#' user is allowed to perform 'doAs'
+ operations.
+
+ The '#USER#' must be replaced with the username o the user who is
+ allowed to perform 'doAs' operations.
+
+ The value can be the '*' wildcard or a list of hostnames.
+
+ For multiple users copy this property and replace the user name
+ in the property name.
+
+
+
+
+ httpfs.proxyuser.#USER#.groups
+ *
+
+ List of groups the '#USER#' user is allowed to impersonate users
+ from to perform 'doAs' operations.
+
+ The '#USER#' must be replaced with the username o the user who is
+ allowed to perform 'doAs' operations.
+
+ The value can be the '*' wildcard or a list of groups.
+
+ For multiple users copy this property and replace the user name
+ in the property name.
+
+
+
+
+
+
+ httpfs.delegation.token.manager.update.interval
+ 86400
+
+ HttpFS delegation token update interval, default 1 day, in seconds.
+
+
+
+
+ httpfs.delegation.token.manager.max.lifetime
+ 604800
+
+ HttpFS delegation token maximum lifetime, default 7 days, in seconds
+
+
+
+
+ httpfs.delegation.token.manager.renewal.interval
+ 86400
+
+ HttpFS delegation token update interval, default 1 day, in seconds.
+
+
+
+
+
+
+ httpfs.hadoop.authentication.type
+ simple
+
+ Defines the authentication mechanism used by httpfs to connect to
+ the HDFS Namenode.
+
+ Valid values are 'simple' and 'kerberos'.
+
+
+
+
+ httpfs.hadoop.authentication.kerberos.keytab
+ ${user.home}/httpfs.keytab
+
+ The Kerberos keytab file with the credentials for the
+ Kerberos principal used by httpfs to connect to the HDFS Namenode.
+
+
+
+
+ httpfs.hadoop.authentication.kerberos.principal
+ ${user.name}/${httpfs.hostname}@${kerberos.realm}
+
+ The Kerberos principal used by httpfs to connect to the HDFS Namenode.
+
+
+
+
+ httpfs.hadoop.filesystem.cache.purge.frequency
+ 60
+
+ Frequency, in seconds, for the idle filesystem purging daemon runs.
+
+
+
+
+ httpfs.hadoop.filesystem.cache.purge.timeout
+ 60
+
+ Timeout, in seconds, for an idle filesystem to be purged.
+
+
+
+
+ httpfs.user.provider.user.pattern
+ ^[A-Za-z_][A-Za-z0-9._-]*[$]?$
+
+ Valid pattern for user and group names, it must be a valid java regex.
+
+
+
+
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs.properties b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs.properties
new file mode 100644
index 0000000..fa34e54
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/httpfs.properties
@@ -0,0 +1,21 @@
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+httpfs.version=2.2.0
+
+httpfs.source.repository=REPO NOT AVAIL
+httpfs.source.revision=REVISION NOT AVAIL
+
+httpfs.build.username=andrew.mcdermott
+httpfs.build.timestamp=2014-02-12T17:39:56+0000
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$1.class
new file mode 100644
index 0000000..49abc0c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$2.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$2.class
new file mode 100644
index 0000000..9989a85
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$2.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$3.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$3.class
new file mode 100644
index 0000000..af936a1
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$3.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$4.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$4.class
new file mode 100644
index 0000000..1d84673
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$4.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$5.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$5.class
new file mode 100644
index 0000000..f5cb24f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$5.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$6.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$6.class
new file mode 100644
index 0000000..e974769
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$6.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$FILE_TYPE.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$FILE_TYPE.class
new file mode 100644
index 0000000..b976dbc
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$FILE_TYPE.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataInputStream.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataInputStream.class
new file mode 100644
index 0000000..4adcb6a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataInputStream.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataOutputStream.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataOutputStream.class
new file mode 100644
index 0000000..dfb08d9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$HttpFSDataOutputStream.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$Operation.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$Operation.class
new file mode 100644
index 0000000..aa5611a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem$Operation.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem.class
new file mode 100644
index 0000000..f93e70d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSFileSystem.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator$DelegationTokenOperation.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator$DelegationTokenOperation.class
new file mode 100644
index 0000000..b689700
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator$DelegationTokenOperation.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator.class
new file mode 100644
index 0000000..269cc20
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSKerberosAuthenticator.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSPseudoAuthenticator.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSPseudoAuthenticator.class
new file mode 100644
index 0000000..ac5cf2c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSPseudoAuthenticator.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSUtils.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSUtils.class
new file mode 100644
index 0000000..044305e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/client/HttpFSUtils.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/CheckUploadContentTypeFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/CheckUploadContentTypeFilter.class
new file mode 100644
index 0000000..740000b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/CheckUploadContentTypeFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSAppend.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSAppend.class
new file mode 100644
index 0000000..d7c57ad
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSAppend.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSConcat.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSConcat.class
new file mode 100644
index 0000000..327e9da
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSConcat.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSContentSummary.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSContentSummary.class
new file mode 100644
index 0000000..1b3cbbf
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSContentSummary.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSCreate.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSCreate.class
new file mode 100644
index 0000000..7b01b41
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSCreate.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSDelete.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSDelete.class
new file mode 100644
index 0000000..6e2bfe7
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSDelete.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileChecksum.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileChecksum.class
new file mode 100644
index 0000000..6c69e93
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileChecksum.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileStatus.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileStatus.class
new file mode 100644
index 0000000..750e2e0
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSFileStatus.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSHomeDir.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSHomeDir.class
new file mode 100644
index 0000000..dde4e76
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSHomeDir.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSListStatus.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSListStatus.class
new file mode 100644
index 0000000..25c2a48
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSListStatus.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSMkdirs.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSMkdirs.class
new file mode 100644
index 0000000..4d74c68
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSMkdirs.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSOpen.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSOpen.class
new file mode 100644
index 0000000..57921ff
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSOpen.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSRename.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSRename.class
new file mode 100644
index 0000000..5c6a7c9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSRename.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetOwner.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetOwner.class
new file mode 100644
index 0000000..ac33241
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetOwner.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetPermission.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetPermission.class
new file mode 100644
index 0000000..58cba4f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetPermission.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetReplication.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetReplication.class
new file mode 100644
index 0000000..5247ff7
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetReplication.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetTimes.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetTimes.class
new file mode 100644
index 0000000..8165ce1
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations$FSSetTimes.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations.class
new file mode 100644
index 0000000..b3ee73c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/FSOperations.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSAuthenticationFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSAuthenticationFilter.class
new file mode 100644
index 0000000..c74583a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSAuthenticationFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSExceptionProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSExceptionProvider.class
new file mode 100644
index 0000000..8ac59d0
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSExceptionProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler$1.class
new file mode 100644
index 0000000..7d52614
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler.class
new file mode 100644
index 0000000..5f43883
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSKerberosAuthenticationHandler.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$AccessTimeParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$AccessTimeParam.class
new file mode 100644
index 0000000..9041b77
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$AccessTimeParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$BlockSizeParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$BlockSizeParam.class
new file mode 100644
index 0000000..6d96d40
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$BlockSizeParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DataParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DataParam.class
new file mode 100644
index 0000000..7be8cfe
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DataParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DestinationParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DestinationParam.class
new file mode 100644
index 0000000..80efa24
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DestinationParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DoAsParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DoAsParam.class
new file mode 100644
index 0000000..85b1be8
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$DoAsParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$FilterParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$FilterParam.class
new file mode 100644
index 0000000..9ad3d25
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$FilterParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$GroupParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$GroupParam.class
new file mode 100644
index 0000000..c336620
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$GroupParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$LenParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$LenParam.class
new file mode 100644
index 0000000..2cb4327
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$LenParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ModifiedTimeParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ModifiedTimeParam.class
new file mode 100644
index 0000000..ff5ca61
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ModifiedTimeParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OffsetParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OffsetParam.class
new file mode 100644
index 0000000..bb9ad38
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OffsetParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OperationParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OperationParam.class
new file mode 100644
index 0000000..0361a57
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OperationParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OverwriteParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OverwriteParam.class
new file mode 100644
index 0000000..1123645
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OverwriteParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OwnerParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OwnerParam.class
new file mode 100644
index 0000000..fe2d7f6
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$OwnerParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$PermissionParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$PermissionParam.class
new file mode 100644
index 0000000..7d44c19
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$PermissionParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$RecursiveParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$RecursiveParam.class
new file mode 100644
index 0000000..d8140e8
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$RecursiveParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ReplicationParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ReplicationParam.class
new file mode 100644
index 0000000..8e74641
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$ReplicationParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$SourcesParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$SourcesParam.class
new file mode 100644
index 0000000..d97c534
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider$SourcesParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider.class
new file mode 100644
index 0000000..93058c9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSParametersProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSReleaseFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSReleaseFilter.class
new file mode 100644
index 0000000..9e97913
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSReleaseFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer$1.class
new file mode 100644
index 0000000..0338f6d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer.class
new file mode 100644
index 0000000..435e144
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServer.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServerWebApp.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServerWebApp.class
new file mode 100644
index 0000000..3b1fa0f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/fs/http/server/HttpFSServerWebApp.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/RunnableCallable.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/RunnableCallable.class
new file mode 100644
index 0000000..d4c9d68
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/RunnableCallable.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException$ERROR.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException$ERROR.class
new file mode 100644
index 0000000..17ab56d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException$ERROR.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException.class
new file mode 100644
index 0000000..c9d7305
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/lang/XException.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/BaseService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/BaseService.class
new file mode 100644
index 0000000..e27aa3c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/BaseService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server$Status.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server$Status.class
new file mode 100644
index 0000000..05c4788
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server$Status.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server.class
new file mode 100644
index 0000000..7819c2f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Server.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException$ERROR.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException$ERROR.class
new file mode 100644
index 0000000..77be81c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException$ERROR.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException.class
new file mode 100644
index 0000000..f2393ca
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServerException.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Service.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Service.class
new file mode 100644
index 0000000..df904d0
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/Service.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServiceException.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServiceException.class
new file mode 100644
index 0000000..4708dba
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/server/ServiceException.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenIdentifier.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenIdentifier.class
new file mode 100644
index 0000000..4eb511b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenIdentifier.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManager.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManager.class
new file mode 100644
index 0000000..715cab3
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManager.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException$ERROR.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException$ERROR.class
new file mode 100644
index 0000000..5ccfff8
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException$ERROR.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException.class
new file mode 100644
index 0000000..5992d5a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/DelegationTokenManagerException.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess$FileSystemExecutor.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess$FileSystemExecutor.class
new file mode 100644
index 0000000..cba6a24
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess$FileSystemExecutor.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess.class
new file mode 100644
index 0000000..1c60b44
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccess.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException$ERROR.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException$ERROR.class
new file mode 100644
index 0000000..d14e92e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException$ERROR.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException.class
new file mode 100644
index 0000000..636ad09
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/FileSystemAccessException.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Groups.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Groups.class
new file mode 100644
index 0000000..3eb303a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Groups.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Cron.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Cron.class
new file mode 100644
index 0000000..fa84e4a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Cron.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Variable.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Variable.class
new file mode 100644
index 0000000..2bc1179
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation$Variable.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation.class
new file mode 100644
index 0000000..862cecc
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Instrumentation.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/ProxyUser.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/ProxyUser.class
new file mode 100644
index 0000000..7cfc8b4
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/ProxyUser.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Scheduler.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Scheduler.class
new file mode 100644
index 0000000..1861e56
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/Scheduler.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$1.class
new file mode 100644
index 0000000..dc35f9c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$2.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$2.class
new file mode 100644
index 0000000..6ae37e2
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$2.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$3.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$3.class
new file mode 100644
index 0000000..8045c21
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$3.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$4.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$4.class
new file mode 100644
index 0000000..285d621
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$4.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$CachedFileSystem.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$CachedFileSystem.class
new file mode 100644
index 0000000..67dfe52
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$CachedFileSystem.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$FileSystemCachePurger.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$FileSystemCachePurger.class
new file mode 100644
index 0000000..dd75663
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService$FileSystemCachePurger.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService.class
new file mode 100644
index 0000000..a354230
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/hadoop/FileSystemAccessService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$1.class
new file mode 100644
index 0000000..dbc269c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$2.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$2.class
new file mode 100644
index 0000000..5665910
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$2.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$3.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$3.class
new file mode 100644
index 0000000..702e047
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$3.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Cron.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Cron.class
new file mode 100644
index 0000000..2d17a5e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Cron.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Sampler.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Sampler.class
new file mode 100644
index 0000000..bdb2f70
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Sampler.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$SamplersRunnable.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$SamplersRunnable.class
new file mode 100644
index 0000000..fc6b658
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$SamplersRunnable.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Timer.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Timer.class
new file mode 100644
index 0000000..26b4b2e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$Timer.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$VariableHolder.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$VariableHolder.class
new file mode 100644
index 0000000..9ccef87
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService$VariableHolder.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.class
new file mode 100644
index 0000000..1e1c8ce
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService$1.class
new file mode 100644
index 0000000..9b2aaf7
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService.class
new file mode 100644
index 0000000..ec12d9d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/scheduler/SchedulerService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService$DelegationTokenSecretManager.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService$DelegationTokenSecretManager.class
new file mode 100644
index 0000000..5f923ec
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService$DelegationTokenSecretManager.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService.class
new file mode 100644
index 0000000..d1adcad
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/DelegationTokenManagerService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/GroupsService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/GroupsService.class
new file mode 100644
index 0000000..3d86f54
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/GroupsService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService$ERROR.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService$ERROR.class
new file mode 100644
index 0000000..a6fb583
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService$ERROR.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService.class
new file mode 100644
index 0000000..09fd9f3
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/service/security/ProxyUserService.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/FileSystemReleaseFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/FileSystemReleaseFilter.class
new file mode 100644
index 0000000..2db5ff2
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/FileSystemReleaseFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/HostnameFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/HostnameFilter.class
new file mode 100644
index 0000000..678775d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/HostnameFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/MDCFilter.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/MDCFilter.class
new file mode 100644
index 0000000..f42eda5
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/MDCFilter.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/ServerWebApp.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/ServerWebApp.class
new file mode 100644
index 0000000..e3c7404
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/servlet/ServerWebApp.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/Check.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/Check.class
new file mode 100644
index 0000000..7932eb5
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/Check.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/ConfigurationUtils.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/ConfigurationUtils.class
new file mode 100644
index 0000000..b88668b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/util/ConfigurationUtils.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/BooleanParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/BooleanParam.class
new file mode 100644
index 0000000..0619b81
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/BooleanParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ByteParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ByteParam.class
new file mode 100644
index 0000000..a86448b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ByteParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/EnumParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/EnumParam.class
new file mode 100644
index 0000000..99f1eaf
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/EnumParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ExceptionProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ExceptionProvider.class
new file mode 100644
index 0000000..b0c2693
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ExceptionProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/InputStreamEntity.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/InputStreamEntity.class
new file mode 100644
index 0000000..34eb9ca
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/InputStreamEntity.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/IntegerParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/IntegerParam.class
new file mode 100644
index 0000000..0a0ec02
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/IntegerParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONMapProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONMapProvider.class
new file mode 100644
index 0000000..f6a1c40
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONMapProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONProvider.class
new file mode 100644
index 0000000..248f53b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/JSONProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/LongParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/LongParam.class
new file mode 100644
index 0000000..f5a6ba3
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/LongParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Param.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Param.class
new file mode 100644
index 0000000..414ab9c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Param.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Parameters.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Parameters.class
new file mode 100644
index 0000000..1d5fb1f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/Parameters.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ParametersProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ParametersProvider.class
new file mode 100644
index 0000000..15b80b3
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ParametersProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ShortParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ShortParam.class
new file mode 100644
index 0000000..386c4d9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/ShortParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/StringParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/StringParam.class
new file mode 100644
index 0000000..4aeda70
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/StringParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$1.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$1.class
new file mode 100644
index 0000000..aceb693
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$1.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$UserParam.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$UserParam.class
new file mode 100644
index 0000000..8d2a019
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider$UserParam.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider.class b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider.class
new file mode 100644
index 0000000..a85e24b
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/classes/org/apache/hadoop/lib/wsrs/UserProvider.class differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/activation-1.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/activation-1.1.jar
new file mode 100644
index 0000000..53f82a1
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/activation-1.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/asm-3.2.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/asm-3.2.jar
new file mode 100644
index 0000000..ca9f8d2
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/asm-3.2.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/avro-1.7.4.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/avro-1.7.4.jar
new file mode 100644
index 0000000..69dd87d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/avro-1.7.4.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-1.7.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-1.7.0.jar
new file mode 100644
index 0000000..b1b89c9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-1.7.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-core-1.8.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-core-1.8.0.jar
new file mode 100644
index 0000000..87c15f4
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-beanutils-core-1.8.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-cli-1.2.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-cli-1.2.jar
new file mode 100644
index 0000000..ce4b9ff
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-cli-1.2.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-codec-1.4.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-codec-1.4.jar
new file mode 100644
index 0000000..458d432
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-codec-1.4.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-collections-3.2.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-collections-3.2.1.jar
new file mode 100644
index 0000000..c35fa1f
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-collections-3.2.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-compress-1.4.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-compress-1.4.1.jar
new file mode 100644
index 0000000..b58761e
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-compress-1.4.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-configuration-1.6.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-configuration-1.6.jar
new file mode 100644
index 0000000..2d4689a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-configuration-1.6.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-daemon-1.0.13.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-daemon-1.0.13.jar
new file mode 100644
index 0000000..ac77321
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-daemon-1.0.13.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-digester-1.8.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-digester-1.8.jar
new file mode 100644
index 0000000..1110f0a
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-digester-1.8.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-io-2.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-io-2.1.jar
new file mode 100644
index 0000000..b5c7d69
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-io-2.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-lang-2.5.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-lang-2.5.jar
new file mode 100644
index 0000000..ae491da
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-lang-2.5.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-logging-1.1.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-logging-1.1.1.jar
new file mode 100644
index 0000000..1deef14
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-logging-1.1.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-math-2.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-math-2.1.jar
new file mode 100644
index 0000000..43b4b36
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-math-2.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-net-3.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-net-3.1.jar
new file mode 100644
index 0000000..b75f1a5
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-net-3.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/guava-11.0.2.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/guava-11.0.2.jar
new file mode 100644
index 0000000..c8c8d5d
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/guava-11.0.2.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-annotations-2.2.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-annotations-2.2.0.jar
new file mode 100644
index 0000000..6799af2
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-annotations-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-auth-2.2.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-auth-2.2.0.jar
new file mode 100644
index 0000000..88d3e15
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-auth-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.2.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.2.0.jar
new file mode 100644
index 0000000..89dbcea
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-hdfs-2.2.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-hdfs-2.2.0.jar
new file mode 100644
index 0000000..9e01d43
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-hdfs-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.8.8.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.8.8.jar
new file mode 100644
index 0000000..05f3353
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-jaxrs-1.8.8.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-jaxrs-1.8.8.jar
new file mode 100644
index 0000000..21b31c2
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-jaxrs-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-mapper-asl-1.8.8.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-mapper-asl-1.8.8.jar
new file mode 100644
index 0000000..7c7cd21
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-mapper-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-xc-1.8.8.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-xc-1.8.8.jar
new file mode 100644
index 0000000..ebfbf41
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-xc-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-api-2.2.2.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-api-2.2.2.jar
new file mode 100644
index 0000000..31e5fa0
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-api-2.2.2.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-impl-2.2.3-1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-impl-2.2.3-1.jar
new file mode 100644
index 0000000..eeaf660
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jaxb-impl-2.2.3-1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-core-1.9.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-core-1.9.jar
new file mode 100644
index 0000000..548dd88
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-core-1.9.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-json-1.9.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-json-1.9.jar
new file mode 100644
index 0000000..b1a4ce5
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-json-1.9.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-server-1.9.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-server-1.9.jar
new file mode 100644
index 0000000..ae0117c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jersey-server-1.9.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jettison-1.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jettison-1.1.jar
new file mode 100644
index 0000000..e4e9c8c
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jettison-1.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsch-0.1.42.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsch-0.1.42.jar
new file mode 100644
index 0000000..c65eff0
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsch-0.1.42.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/json-simple-1.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/json-simple-1.1.jar
new file mode 100644
index 0000000..f395f41
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/json-simple-1.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsr305-1.3.9.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsr305-1.3.9.jar
new file mode 100644
index 0000000..a9afc66
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jsr305-1.3.9.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar
new file mode 100644
index 0000000..1d425cf
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/paranamer-2.3.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/paranamer-2.3.jar
new file mode 100644
index 0000000..ad12ae9
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/paranamer-2.3.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/protobuf-java-2.5.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/protobuf-java-2.5.0.jar
new file mode 100644
index 0000000..4c4e686
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/protobuf-java-2.5.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-api-1.7.5.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-api-1.7.5.jar
new file mode 100644
index 0000000..8f004d3
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-api-1.7.5.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar
new file mode 100644
index 0000000..f5298b5
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/snappy-java-1.0.4.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/snappy-java-1.0.4.1.jar
new file mode 100644
index 0000000..8198919
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/snappy-java-1.0.4.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/stax-api-1.0.1.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/stax-api-1.0.1.jar
new file mode 100644
index 0000000..d9a1665
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/stax-api-1.0.1.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xmlenc-0.52.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xmlenc-0.52.jar
new file mode 100644
index 0000000..ec568b4
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xmlenc-0.52.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xz-1.0.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xz-1.0.jar
new file mode 100644
index 0000000..a848f16
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/xz-1.0.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/zookeeper-3.4.5.jar b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/zookeeper-3.4.5.jar
new file mode 100644
index 0000000..a7966bb
Binary files /dev/null and b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/zookeeper-3.4.5.jar differ
diff --git a/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/web.xml b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/web.xml
new file mode 100644
index 0000000..4c0b3ae
--- /dev/null
+++ b/x86_64/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/web.xml
@@ -0,0 +1,98 @@
+
+
+
+
+
+ org.apache.hadoop.fs.http.server.HttpFSServerWebApp
+
+
+
+ webservices-driver
+ com.sun.jersey.spi.container.servlet.ServletContainer
+
+ com.sun.jersey.config.property.packages
+ org.apache.hadoop.fs.http.server,org.apache.hadoop.lib.wsrs
+
+
+
+
+ 1
+
+
+
+ webservices-driver
+ /*
+
+
+
+ authFilter
+ org.apache.hadoop.fs.http.server.HttpFSAuthenticationFilter
+
+
+
+ MDCFilter
+ org.apache.hadoop.lib.servlet.MDCFilter
+
+
+
+ hostnameFilter
+ org.apache.hadoop.lib.servlet.HostnameFilter
+
+
+
+ checkUploadContentType
+ org.apache.hadoop.fs.http.server.CheckUploadContentTypeFilter
+
+
+
+ fsReleaseFilter
+ org.apache.hadoop.fs.http.server.HttpFSReleaseFilter
+
+
+
+ authFilter
+ *
+
+
+
+ MDCFilter
+ *
+
+
+
+ hostnameFilter
+ *
+
+
+
+ checkUploadContentType
+ *
+
+
+
+ fsReleaseFilter
+ *
+
+
+
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.2.0.jar
new file mode 100644
index 0000000..faaf5f2
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar
new file mode 100644
index 0000000..7c04737
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar
new file mode 100644
index 0000000..7a34374
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.2.0.jar
new file mode 100644
index 0000000..16f7650
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.2.0.jar
new file mode 100644
index 0000000..3d45a8b
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar
new file mode 100644
index 0000000..8b8abab
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.jar
new file mode 100644
index 0000000..29a2fc8
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.2.0.jar
new file mode 100644
index 0000000..0a35bae
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
new file mode 100644
index 0000000..73ceee5
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar b/x86_64/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar
new file mode 100644
index 0000000..2da29ed
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/aopalliance-1.0.jar b/x86_64/share/hadoop/mapreduce/lib/aopalliance-1.0.jar
new file mode 100644
index 0000000..578b1a0
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/aopalliance-1.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/asm-3.2.jar b/x86_64/share/hadoop/mapreduce/lib/asm-3.2.jar
new file mode 100644
index 0000000..ca9f8d2
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/asm-3.2.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/avro-1.7.4.jar b/x86_64/share/hadoop/mapreduce/lib/avro-1.7.4.jar
new file mode 100644
index 0000000..69dd87d
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/avro-1.7.4.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar b/x86_64/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar
new file mode 100644
index 0000000..b58761e
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/commons-io-2.1.jar b/x86_64/share/hadoop/mapreduce/lib/commons-io-2.1.jar
new file mode 100644
index 0000000..b5c7d69
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/commons-io-2.1.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/guice-3.0.jar b/x86_64/share/hadoop/mapreduce/lib/guice-3.0.jar
new file mode 100644
index 0000000..f313e2b
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/guice-3.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar b/x86_64/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar
new file mode 100644
index 0000000..bdc6614
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/hadoop-annotations-2.2.0.jar b/x86_64/share/hadoop/mapreduce/lib/hadoop-annotations-2.2.0.jar
new file mode 100644
index 0000000..6799af2
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/hadoop-annotations-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/hamcrest-core-1.1.jar b/x86_64/share/hadoop/mapreduce/lib/hamcrest-core-1.1.jar
new file mode 100644
index 0000000..e5149be
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/hamcrest-core-1.1.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/jackson-core-asl-1.8.8.jar b/x86_64/share/hadoop/mapreduce/lib/jackson-core-asl-1.8.8.jar
new file mode 100644
index 0000000..05f3353
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/jackson-core-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.8.8.jar b/x86_64/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.8.8.jar
new file mode 100644
index 0000000..7c7cd21
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/javax.inject-1.jar b/x86_64/share/hadoop/mapreduce/lib/javax.inject-1.jar
new file mode 100644
index 0000000..b2a9d0b
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/javax.inject-1.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/jersey-core-1.9.jar b/x86_64/share/hadoop/mapreduce/lib/jersey-core-1.9.jar
new file mode 100644
index 0000000..548dd88
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/jersey-core-1.9.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar b/x86_64/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar
new file mode 100644
index 0000000..cb46c94
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/jersey-server-1.9.jar b/x86_64/share/hadoop/mapreduce/lib/jersey-server-1.9.jar
new file mode 100644
index 0000000..ae0117c
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/jersey-server-1.9.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/junit-4.10.jar b/x86_64/share/hadoop/mapreduce/lib/junit-4.10.jar
new file mode 100644
index 0000000..954851e
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/junit-4.10.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/log4j-1.2.17.jar b/x86_64/share/hadoop/mapreduce/lib/log4j-1.2.17.jar
new file mode 100644
index 0000000..1d425cf
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/log4j-1.2.17.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar b/x86_64/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar
new file mode 100644
index 0000000..a421e28
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/paranamer-2.3.jar b/x86_64/share/hadoop/mapreduce/lib/paranamer-2.3.jar
new file mode 100644
index 0000000..ad12ae9
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/paranamer-2.3.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar b/x86_64/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar
new file mode 100644
index 0000000..4c4e686
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar b/x86_64/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar
new file mode 100644
index 0000000..8198919
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/lib/xz-1.0.jar b/x86_64/share/hadoop/mapreduce/lib/xz-1.0.jar
new file mode 100644
index 0000000..a848f16
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/lib/xz-1.0.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-sources.jar
new file mode 100644
index 0000000..20f1cc6
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-test-sources.jar
new file mode 100644
index 0000000..cfb8bbc
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-sources.jar
new file mode 100644
index 0000000..131281b
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-test-sources.jar
new file mode 100644
index 0000000..61f0750
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-common-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-sources.jar
new file mode 100644
index 0000000..2b555a0
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-test-sources.jar
new file mode 100644
index 0000000..b741688
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-core-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-sources.jar
new file mode 100644
index 0000000..bad59d0
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-test-sources.jar
new file mode 100644
index 0000000..fc4dfb7
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-sources.jar
new file mode 100644
index 0000000..736ed17
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-test-sources.jar
new file mode 100644
index 0000000..e2bba40
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-hs-plugins-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-sources.jar
new file mode 100644
index 0000000..e327360
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-test-sources.jar
new file mode 100644
index 0000000..fd9f0c1
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-jobclient-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-sources.jar
new file mode 100644
index 0000000..dcfe16f
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-test-sources.jar
new file mode 100644
index 0000000..293e79f
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-client-shuffle-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-sources.jar
new file mode 100644
index 0000000..e79949c
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-test-sources.jar b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-test-sources.jar
new file mode 100644
index 0000000..3bb5161
Binary files /dev/null and b/x86_64/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-archives-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-archives-2.2.0.jar
new file mode 100644
index 0000000..6499c1f
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-archives-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-datajoin-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-datajoin-2.2.0.jar
new file mode 100644
index 0000000..e1e1d4e
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-datajoin-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-distcp-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-distcp-2.2.0.jar
new file mode 100644
index 0000000..72fdbf7
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-distcp-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-extras-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-extras-2.2.0.jar
new file mode 100644
index 0000000..c5abe95
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-extras-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-gridmix-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-gridmix-2.2.0.jar
new file mode 100644
index 0000000..0f25141
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-gridmix-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-rumen-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-rumen-2.2.0.jar
new file mode 100644
index 0000000..fffcd5e
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-rumen-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar b/x86_64/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
new file mode 100644
index 0000000..b86f7f6
Binary files /dev/null and b/x86_64/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-sources.jar
new file mode 100644
index 0000000..0d7d77b
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-test-sources.jar
new file mode 100644
index 0000000..e5e3e68
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-archives-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-sources.jar
new file mode 100644
index 0000000..441add9
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-test-sources.jar
new file mode 100644
index 0000000..dd46787
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-datajoin-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-sources.jar
new file mode 100644
index 0000000..8f8171e
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-test-sources.jar
new file mode 100644
index 0000000..4c00d6d
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-distcp-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-sources.jar
new file mode 100644
index 0000000..f804a27
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-test-sources.jar
new file mode 100644
index 0000000..55e3d60
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-extras-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-sources.jar
new file mode 100644
index 0000000..fb9ca72
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-test-sources.jar
new file mode 100644
index 0000000..ee9176e
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-gridmix-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-sources.jar
new file mode 100644
index 0000000..502b3b1
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-test-sources.jar
new file mode 100644
index 0000000..3aa2e95
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-rumen-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-sources.jar
new file mode 100644
index 0000000..8010536
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-test-sources.jar b/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-test-sources.jar
new file mode 100644
index 0000000..6e6c2e7
Binary files /dev/null and b/x86_64/share/hadoop/tools/sources/hadoop-streaming-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-api-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-api-2.2.0.jar
new file mode 100644
index 0000000..c6d30b3
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-api-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar
new file mode 100644
index 0000000..51985dc
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar
new file mode 100644
index 0000000..f775354
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-client-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-client-2.2.0.jar
new file mode 100644
index 0000000..0f501e1
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-client-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-common-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-common-2.2.0.jar
new file mode 100644
index 0000000..9f0f04a
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-common-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-server-common-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-server-common-2.2.0.jar
new file mode 100644
index 0000000..80ed4e3
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-server-common-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.2.0.jar
new file mode 100644
index 0000000..b0bd77d
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.2.0.jar
new file mode 100644
index 0000000..9700e54
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-server-tests-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-server-tests-2.2.0.jar
new file mode 100644
index 0000000..c1f9a05
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-server-tests-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.2.0.jar
new file mode 100644
index 0000000..ef29ba1
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/hadoop-yarn-site-2.2.0.jar b/x86_64/share/hadoop/yarn/hadoop-yarn-site-2.2.0.jar
new file mode 100644
index 0000000..29fbf6e
Binary files /dev/null and b/x86_64/share/hadoop/yarn/hadoop-yarn-site-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib-examples/hsqldb-2.0.0.jar b/x86_64/share/hadoop/yarn/lib-examples/hsqldb-2.0.0.jar
new file mode 100644
index 0000000..2da29ed
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib-examples/hsqldb-2.0.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/aopalliance-1.0.jar b/x86_64/share/hadoop/yarn/lib/aopalliance-1.0.jar
new file mode 100644
index 0000000..578b1a0
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/aopalliance-1.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/asm-3.2.jar b/x86_64/share/hadoop/yarn/lib/asm-3.2.jar
new file mode 100644
index 0000000..ca9f8d2
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/asm-3.2.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/avro-1.7.4.jar b/x86_64/share/hadoop/yarn/lib/avro-1.7.4.jar
new file mode 100644
index 0000000..69dd87d
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/avro-1.7.4.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/commons-compress-1.4.1.jar b/x86_64/share/hadoop/yarn/lib/commons-compress-1.4.1.jar
new file mode 100644
index 0000000..b58761e
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/commons-compress-1.4.1.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/commons-io-2.1.jar b/x86_64/share/hadoop/yarn/lib/commons-io-2.1.jar
new file mode 100644
index 0000000..b5c7d69
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/commons-io-2.1.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/guice-3.0.jar b/x86_64/share/hadoop/yarn/lib/guice-3.0.jar
new file mode 100644
index 0000000..f313e2b
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/guice-3.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/guice-servlet-3.0.jar b/x86_64/share/hadoop/yarn/lib/guice-servlet-3.0.jar
new file mode 100644
index 0000000..bdc6614
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/guice-servlet-3.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/hadoop-annotations-2.2.0.jar b/x86_64/share/hadoop/yarn/lib/hadoop-annotations-2.2.0.jar
new file mode 100644
index 0000000..6799af2
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/hadoop-annotations-2.2.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/hamcrest-core-1.1.jar b/x86_64/share/hadoop/yarn/lib/hamcrest-core-1.1.jar
new file mode 100644
index 0000000..e5149be
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/hamcrest-core-1.1.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/jackson-core-asl-1.8.8.jar b/x86_64/share/hadoop/yarn/lib/jackson-core-asl-1.8.8.jar
new file mode 100644
index 0000000..05f3353
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/jackson-core-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/jackson-mapper-asl-1.8.8.jar b/x86_64/share/hadoop/yarn/lib/jackson-mapper-asl-1.8.8.jar
new file mode 100644
index 0000000..7c7cd21
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/jackson-mapper-asl-1.8.8.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/javax.inject-1.jar b/x86_64/share/hadoop/yarn/lib/javax.inject-1.jar
new file mode 100644
index 0000000..b2a9d0b
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/javax.inject-1.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/jersey-core-1.9.jar b/x86_64/share/hadoop/yarn/lib/jersey-core-1.9.jar
new file mode 100644
index 0000000..548dd88
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/jersey-core-1.9.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/jersey-guice-1.9.jar b/x86_64/share/hadoop/yarn/lib/jersey-guice-1.9.jar
new file mode 100644
index 0000000..cb46c94
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/jersey-guice-1.9.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/jersey-server-1.9.jar b/x86_64/share/hadoop/yarn/lib/jersey-server-1.9.jar
new file mode 100644
index 0000000..ae0117c
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/jersey-server-1.9.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/junit-4.10.jar b/x86_64/share/hadoop/yarn/lib/junit-4.10.jar
new file mode 100644
index 0000000..954851e
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/junit-4.10.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/log4j-1.2.17.jar b/x86_64/share/hadoop/yarn/lib/log4j-1.2.17.jar
new file mode 100644
index 0000000..1d425cf
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/log4j-1.2.17.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/netty-3.6.2.Final.jar b/x86_64/share/hadoop/yarn/lib/netty-3.6.2.Final.jar
new file mode 100644
index 0000000..a421e28
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/netty-3.6.2.Final.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/paranamer-2.3.jar b/x86_64/share/hadoop/yarn/lib/paranamer-2.3.jar
new file mode 100644
index 0000000..ad12ae9
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/paranamer-2.3.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar b/x86_64/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar
new file mode 100644
index 0000000..4c4e686
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/snappy-java-1.0.4.1.jar b/x86_64/share/hadoop/yarn/lib/snappy-java-1.0.4.1.jar
new file mode 100644
index 0000000..8198919
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/snappy-java-1.0.4.1.jar differ
diff --git a/x86_64/share/hadoop/yarn/lib/xz-1.0.jar b/x86_64/share/hadoop/yarn/lib/xz-1.0.jar
new file mode 100644
index 0000000..a848f16
Binary files /dev/null and b/x86_64/share/hadoop/yarn/lib/xz-1.0.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-api-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-api-2.2.0-sources.jar
new file mode 100644
index 0000000..cd57496
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-api-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-sources.jar
new file mode 100644
index 0000000..318efd2
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-test-sources.jar
new file mode 100644
index 0000000..b4e679a
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-distributedshell-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-sources.jar
new file mode 100644
index 0000000..b92aa23
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-test-sources.jar
new file mode 100644
index 0000000..8fea641
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-sources.jar
new file mode 100644
index 0000000..4d96e85
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-test-sources.jar
new file mode 100644
index 0000000..076e036
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-client-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-sources.jar
new file mode 100644
index 0000000..6f2ccc6
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-test-sources.jar
new file mode 100644
index 0000000..89a9975
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-common-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-sources.jar
new file mode 100644
index 0000000..dee7617
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-test-sources.jar
new file mode 100644
index 0000000..2888e50
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-common-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-sources.jar
new file mode 100644
index 0000000..6a37e16
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-test-sources.jar
new file mode 100644
index 0000000..e2f6024
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-nodemanager-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-sources.jar
new file mode 100644
index 0000000..6e4bf28
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-test-sources.jar
new file mode 100644
index 0000000..641d8b8
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-resourcemanager-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-tests-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-tests-2.2.0-test-sources.jar
new file mode 100644
index 0000000..6be4a99
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-tests-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-sources.jar
new file mode 100644
index 0000000..323f94e
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-test-sources.jar b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-test-sources.jar
new file mode 100644
index 0000000..c3b6a6e
Binary files /dev/null and b/x86_64/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.2.0-test-sources.jar differ
diff --git a/x86_64/share/hadoop/yarn/test/hadoop-yarn-server-tests-2.2.0-tests.jar b/x86_64/share/hadoop/yarn/test/hadoop-yarn-server-tests-2.2.0-tests.jar
new file mode 100644
index 0000000..fea3750
Binary files /dev/null and b/x86_64/share/hadoop/yarn/test/hadoop-yarn-server-tests-2.2.0-tests.jar differ
--
cgit v1.2.3