Thursday, December 27, 2012

JAVA IO - What is the best choice?

In general we have 2 networking API's :

java.net.Socket ( via Straams)

 Socket class, writes use SocketOutputStream. The write() method ends up invoking a JNI method, socketWrite0(). This function has a local stack allocated buffer of length MAX_BUFFER_LEN, which is set to 8192 innet_util_md.h). If the array fits in this buffer, it is copied using GetByteArrayRegion(). Finally, the implementation calls NET_Send, a wrapper around the send() system call. This means that every call to write a byte array in Java makes at least one copy that could be avoided in C. Even worse, if the Java byte array is longer than 8192 bytes, the code calls malloc() to allocate a buffer of up to 64 kB, then copies into that buffer.
Conclusion
Don't make calls towrite() with arrays larger than 8 kB, since calling malloc() and free() for each write will impact performance.


java.nio.SocketChannel


With the newer NIO package, writes must use ByteBuffers. When writing, the data first ends up at sun.nio.ch.SocketChannelImpl. It acquires some locks then calls sun.nio.ch.IOUtil.write, which checks the type of ByteBuffer. If it is a heap buffer, a temporary direct ByteBuffer is allocated from a pool and the data is copied using ByteBuffer.put(). The direct ByteBuffer is eventually written by calling sun.nio.ch.FileDispatcherImpl.write0, a JNI method. TheUnix implementation finally calls write() with the raw address from the direct ByteBuffer.


Benchmark conclusion

  • OutputStream: When writing byte[] arrays larger than 8192 bytes, performance takes a hit. Read/write in chunks ≤ 8192 bytes.
  • ByteBuffer: direct ByteBuffers are faster than heap buffers for filling with bytes and integers. However, array copies are faster with heap ByteBuffers (results not shown here). Allocation and deallocation is apparently more expensive for direct ByteBuffers as well.
  • Little endian or big endian: Doesn't matter for byte[], but little endian is faster for putting ints in ByteBuffers on a little endian machine.
  • ByteBuffer versus byte[]: ByteBuffers are faster for I/O, but worse for filling with data.
Direct ByteBuffers provide very efficient I/O, but getting data into and out of them is more expensive than byte[] arrays. Thus, the fastest choice is going to be application dependent.
if the buffer size is at least 2048 bytes, it is actually faster to fill a byte[] array, copy it into a direct ByteBuffer, then write that, then to write the byte[] array directly. However for small writes (512 bytes or less), writing the byte[] array using OutputStream is slightly faster.

Generally, using NIO can be a performance win, particularly for large writes. You want to allocate a single direct ByteBuffer, and reuse it for all I/O to and from a particular channel. However, you should serialize and deserialize your data using byte[] arrays, since accessing individual elements from a ByteBuffer is slow.



Source



Tuesday, December 25, 2012

Scalar Profiling on Linux


Gathering information about a program’s behavior, specifically its use of resources, is called profiling. The most important “resource” in terms of high performance computing is runtime. Hence, a common profiling strategy is to find out how much time is spent in the different functions, and maybe even lines, of a code
in order to identify hot spots, i.e., the parts of the program that require the dominant fraction of runtime. These hot spots are analyzed for possible optimization opportunitie.

Function profiling - gprof

The most widely used profiling tool is gprof from the GNU binutils package.
gprof uses both instrumentation and sampling to collect a flat function profile as
well as a callgraph profile, also called a butterfly graph. In order to activate profiling,
the code must be compiled with an appropriate option (many modern compilers
can generate gprof-compliant instrumentation; for the GCC, use -pg) and run
once. This produces a non-human-readable file gmon.out, to be interpreted by the
gprof program. The flat profile contains information about execution times of all
the program’s functions and how often they were called


Hardware performance counters - perf stat


Modern processors feature a small number of performance counters (often far less than ten), which are special on-chip registers that get incremented each time a certain event occurs. Among the usually
several hundred events that can be monitored, there are a few that are most useful for
profiling:
  • Number of bus transactions, i.e., cache line transfers
  • Number of loads and stores
  • Number of floating-point operations
  • Mispredicted branches
  • Pipeline stalls
  • Number of instructions executed
In linux we do it by perf stat .The perf tool supports a list of measurable events.
As an example:
 cpu-cycles OR cycles                       [Hardware event]
 instructions                               [Hardware event]
 cache-references                           [Hardware event]
 cache-misses                               [Hardware event]
 branch-instructions OR branches            [Hardware event]
 branch-misses                              [Hardware event]
 bus-cycles                                 [Hardware event]

 cpu-clock                                  [Software event]
 task-clock                                 [Software event]
 page-faults OR faults                      [Software event]
 minor-faults                               [Software event]
 major-faults                               [Software event]
 context-switches OR cs                     [Software event]
 cpu-migrations OR migrations               [Software event]
 alignment-faults                           [Software event]
 emulation-faults                           [Software event]

 L1-dcache-loads                            [Hardware cache event]
 L1-dcache-load-misses                      [Hardware cache event]
 L1-dcache-stores                           [Hardware cache event]
 L1-dcache-store-misses                     [Hardware cache event]
 L1-dcache-prefetches                       [Hardware cache event]
 L1-dcache-prefetch-misses                  [Hardware cache event]
 L1-icache-loads                            [Hardware cache event]
 L1-icache-load-misses                      [Hardware cache event]
 L1-icache-prefetches                       [Hardware cache event]
 L1-icache-prefetch-misses                  [Hardware cache event]
 LLC-loads                                  [Hardware cache event]
 LLC-load-misses                            [Hardware cache event]
 LLC-stores                                 [Hardware cache event]
 LLC-store-misses                           [Hardware cache event]

 LLC-prefetch-misses                        [Hardware cache event]
 dTLB-loads                                 [Hardware cache event]
 dTLB-load-misses                           [Hardware cache event]
 dTLB-stores                                [Hardware cache event]
 dTLB-store-misses                          [Hardware cache event]
 dTLB-prefetches                            [Hardware cache event]
 dTLB-prefetch-misses                       [Hardware cache event]
 iTLB-loads                                 [Hardware cache event]
 iTLB-load-misses                           [Hardware cache event]
 branch-loads                               [Hardware cache event]
 branch-load-misses                         [Hardware cache event]

 rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]

 mem:<addr>[:access]                        [Hardware breakpoint]

 kvmmmu:kvm_mmu_pagetable_walk              [Tracepoint event]

 [...]

 sched:sched_stat_runtime                   [Tracepoint event]
 sched:sched_pi_setprio                     [Tracepoint event]
 syscalls:sys_enter_socket                  [Tracepoint event]
 syscalls:sys_exit_socket                   [Tracepoint event]
Example Run:
perf stat -B dd if=/dev/zero of=/dev/null count=1000000

1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

            5,099 cache-misses             #      0.005 M/sec (scaled from 66.58%)
          235,384 cache-references         #      0.246 M/sec (scaled from 66.56%)
        9,281,660 branch-misses            #      3.858 %     (scaled from 33.50%)
      240,609,766 branches                 #    251.559 M/sec (scaled from 33.66%)
    1,403,561,257 instructions             #      0.679 IPC   (scaled from 50.23%)
    2,066,201,729 cycles                   #   2160.227 M/sec (scaled from 66.67%)
              217 page-faults              #      0.000 M/sec
                3 CPU-migrations           #      0.000 M/sec
               83 context-switches         #      0.000 M/sec
       956.474238 task-clock-msecs         #      0.999 CPUs

       0.957617512  seconds time elapsed

Links
AMD & Intel Processors specs
https://docs.google.com/open?id=0B83rvqbRt-ksMEl4U05Od1hUS2c
https://docs.google.com/open?id=0B83rvqbRt-ksR1VXaWMyTUdIQWs


Source

Sunday, December 23, 2012

Data Access optimization

Data Access optimization

Typical latency and bandwidth numbers for data transfer to and from different
devices in computer systems.
This sketch shows an overview of several data paths present in modern parallel
computer systems, and typical ranges for their bandwidths and latencies. The functional
units, which actually perform the computational work, sit at the top of this
hierarchy. In terms of bandwidth, the slowest data paths are three to four orders of
magnitude away, and eight in terms of latency. The deeper a data transfer must reach
down through the different levels in order to obtain required operands for some calculation,
the harder the impact on performance. Any optimization attempt should
therefore first aim at reducing traffic over slow data paths, or, should this turn out to
be infeasible, at least make data transfer as efficient as possible.

Optimization tips

Access memory in increasing addresses order.

 In particular:

  • scan arrays in increasing order;
  • scan multidimensional arrays using the rightmost index for innermost loops;
  • in class constructors and in assignment operators (operator=), access member variables in the order of declaration.
Data caches optimize memory access in increasing sequential order.
When a multidimensional array is scanned, the innermost loop should iterate on the last index, the innermost-but-one loop should iterate on the last-but-one index, and so on. In such a way, it is guaranteed that array cells are processed in the same order in which they are arranged in memory

Memory alignment

Keep the compiler default memory alignment.
Compilers use by default an alignment criterion for fundamental types, for which objects may have only memory addresses that are a multiple of particular factors. Such criterion guarantees top performance, but it may add paddings (or holes) between successive objects.
If it is necessary to avoid such paddings for some structures, use the pragma directive only around such structure definitions.

Grouping functions in compilation units

Define in the same compilation unit all the member functions of a class, all the friend functions of such class, and all the member functions of friend classes of such class, except when the resulting file become unwieldy for its size.
In such a way, both the machine code resulting by the compilation of such functions and the static data defined in such classes and functions will have addresses near each other; in addition, even compilers that do not perform whole program optimization may optimize the calls among these functions.

Grouping variables in compilation units

Define every global variable in the compilation unit in which it is used more often.
In such a way, such variables will have addresses near to each other and to the static variables defined in such compilation units; in addition, even compilers that do not perform whole program optimization may optimize the access to such variables from the functions that use them more often.

Private functions and variables in compilation units

Declare in an anonymous namespace the variables and functions that are global to compilation unit, but not used by other compilation units.
In C language and also in C++, such variables and functions may be declared static. Though, in modern C++, the use of static global variables and functions is not recommended, and should be replaced by variables and functions declared in an anonymous namespace.
In both cases, the compiler is notified that such identifiers will never be used by other compilation units. This allows the compilers that do not perform whole program optimization to optimize the usage of such variables and functions




Source
Introduction to High Performance Computing for Scientists and Engineers,
Georg Hager, Gerhard Wellein

Saturday, December 22, 2012

Learn C++ Bundle

The C language was developed in 1972 by Dennis Richie at Bell Telephone laboratories, primarily as a systems programming language. That is, a language to write operating systems with. Richie’s primary goals were to produce a minimalistic language that was easy to compile, allowed efficient access to memory, produced efficient code, and did not need extensive run-time support. Despite being a fairly low-level high-level language, it was designed to encourage machine and platform independent programming.
C ended up being so efficient and flexible that in 1973, Ritchie and Ken Thompson rewrote most of the UNIX operating system using C. Many previous operating systems had been written in assembly. Unlike assembly, which ties a program to a specific CPU, C’s excellent portability allowed UNIX to be recompiled on many different types of computers, speeding it’s adoption. C and Unix had their fortunes tied together, and C’s popularity was in part tied to the success of UNIX as an operating system.
In 1978, Brian Kernighan and Dennis Ritchie published a book called “The C Programming Language”. This book, which was commonly known as K&R (after the author’s last names), provided an informal specification for the language and became a de facto standard. When maximum portability was needed, programmers would stick to the recommendations in K&R, because most compilers at the time were implemented to K&R standards.
In 1983, the American National Standards Institute (ANSI) formed a committee to establish a formal standard for C. In 1989 (committees take forever to do anything), they finished, and released the C89 standard, more commonly known as ANSI C. In 1990 the International Organization for Standardization adopted ANSI C (with a few minor modifications). This version of C became known as C90. Compilers eventually became ANSI C/C90 compliant, and programs desiring maximum portability were coded to this standard.
In 1999, the ANSI committee released a new version of C called C99. It adopted many features which had already made their way into compilers as extensions, or had been implemented in C++.
C++ (pronounced see plus plus) was developed by Bjarne Stroustrup at Bell Labs as an extension to C, starting in 1979. C++ was ratified in 1998 by the ISO committee, and again in 2003 (called C++03, which is what this tutorial will be teaching).
The underlying design philosophy of C and C++ can be summed up as “trust the programmer” — which is both wonderful, because the compiler will not stand in your way if you try to do something unorthodox that makes sense, but also dangerous, because the compiler will not stand in your way if you try to do something that could produce unexpected results. That is one of the primary reasons why knowing how NOT to code C/C++ is important — because there are quite a few pitfalls that new programmers are likely to fall into if caught unaware.
C++ adds many new features to the C language, and is perhaps best thought of as a superset of C, though this is not strictly true as C99 introduced a few features that do not exist in C++. C++’s claim to fame results primarily from the fact that it is an object-oriented language.

Book Links:

Thinking in C++ - part 1 - Bruce Eckel

https://docs.google.com/open?id=0B83rvqbRt-ksZl9pRXdWRXVkOFU

Thinking in C++ - part 2 - Bruce Eckel

https://docs.google.com/open?id=0B83rvqbRt-ksaW9OdUN4ZkRBR2c

Programming C++ - Bjarne Stroustrup

https://docs.google.com/open?id=0B83rvqbRt-ksemlkbTVfR2VUMXM

Programming C -  Brian Kernighanand Dennis Ritchie

https://docs.google.com/open?id=0B83rvqbRt-ksQ0tRTkN2c3A5dnc

Why Linux is better than Windows ?

Why Linux is better than Windows?

1. No Viruses on Linux.
2. Linux is more stable.
3. Linux protects your device.
4. Linux is free.
5. You get only programs and features you want.
6. Linux is kind of freedom - free license.
7. No drivers are needed.
8. No backdoors.
9.Linux doesn gets slower like windows.
10.No need super modern hardware.
AND MUCH MORE..........

 Android is based on Linux, Apple OS is based on top of linux core.
Everyone knows that Linux is much stable,secured and better.


In near future we will start learning Linux, so keep updated.

Source
Linux

Friday, December 21, 2012

Easy Big Data - Map Reduce - Inside Hadoop

What the hell is Map Reduce?

MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:
Problem : Count number of words in paragraph.
As following :


So the algorithm will look like:
Read a word,
check whether the word is one of the stop words,
if not , add the word in a HashMap with key as the word and set the value to number of occurrences.
If the word is not found in HashMap,
               then add the word and set the value to 1.
 If the word is found, then
               increment the value and word the same in HashMap

The algorithm is serail.If its input is a sentence - it works perfect. But if its input will be wikipedia - it will work for century!
So probably we need to a diffirent solution....
MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:

New solution:

Lets take the same problem and divide the same into 2 steps. In the first step, we take each sentence each and map the number of words in that sentence.


Once, the words have been mapped, lets move to the next step. In this step, we combine (reduce) the maps from two sentences into a single map.
Sentences were  mapped individually and then once mapped, were reduced to a single resulting map.

  • The whole process got distributed in small tasks that will help  in faster completion of the job
  • Both the steps can be broken down into tasks. In the first, instance, run multiple map tasks, once the mapping is done, run multiple reduce tasks to combine the results and finally aggregate the results

In other words, it's like you wanted to run it on seperate threads, and you need to find some solutio for doing it without locks.
This way you have 2 separate tasks : Map and reduce. and each one of them can run totally independent.

Adding HDFS

(For HDFS - Read HDFS on Rami on the web )
Now, imagine this MapReduce paradigm working on the HDFS. HDFS has data nodes that splits and store the files in blocks. Now, if map the tasks on each of the data nodes, then we can easily leverage the compute power of those data node machines.
So, each of the data nodes, can run tasks (map or reduce) which are the essence of the MapReduce. As each data nodes stores data for multiple files, multiple tasks might be running at the same time for different data blocks.

Comming Next - Hadoop - First steps

Source Hadoop

Me, Myself and My story

School

I was born In Lithuania, Vilnius in 1981.My Dad is Software developer and holds Msc in Mathematics and Informatics and was working in huge mall in Vilnius in its IT.My mom holds BSc in Economics but didn't work in it even a day.I went to 55th school at Taykos street. At my age of 12 my family immigrated to Israel and my new life began. I went to school, met new friends.In high school i did History and computer science my intensified lessons. We did a lot of Pascal and prolog stuff. Since i knew about programming from my dad. But in parallel my favorite was History of ancient world and American Nation.

Military

Then at my age of 19 i went to Israel Defense Forces. I served 3 years in Communication Forces there handled different types of secured communication : voice, data ,networking on top of cellular,radio,satellite channels.Military activities on sea, air and land.  Also completed course of team commanders with excellence award and in 2002 received president's excellence award.

Study 

in 2003 after IDF service i started studying Computer Science in Holon Institute of technology and in 2006 was graduated. My favorite lessons were programming in OOP (C++/C# ), descrete mathematics and algorithms.The specialisation were computer architecture and artificial intelligence.
Final projects
AI: Unique algorithm for minimization of FSM's.
Computer architecture : RISC processor based on FPGA microchip with about 20 commands.

Work
On my last semester ( 2006, May ) i started working in Amdocs :
Possition : Infrastructure engineer 

Subjects  

Operating systems : HPUX,  Solaris, AIX, Linux Win.
Scripting : tcsh. ksh,tcl, python, jython,ANT
Infrastructure: J2EE
Application servers : Websphere / Weblogic/ JBoss/Tomcat/IHS
Programming languages : Java
Amdocs Products : Clarify CRM, Enabler Billing
Deployments : AT&T - Light speed, Sprint .
Middle wares : IONA Artix, SonicMQ broker
DB: Oracle. MySQL
Technologies : JMS, EJB, MDB, Web services
Architecture solutions : High Availability ,Scalability , Geo - redundancy , In - Memory caching,  Clustering .

Then, in 2008, i switched position to : 
Amdocs Billing R&D developer
Programming languages: Java, C/C++
IDEs: Eclipse/ Visual studio 2005/2010
Technologies : Networking , Multi threading , Non blocking technologies, In memory solutions, 
event processing, soft real time programming,  frameworks architecture, optimizations,Shared memory,serialization, Inter process communication and synchronization.
Serialization : Hessian, JSON, binary
3rd Parties: Apache Mina, JMS, JMX,EJB,ACE, Boost
Protocols : TCP. IPv4, IPv6,UDP,Diameter.
End to End software development cycle : design, coding , testing, optimization, debugging, presentation.
There i work till today.

In parallel i consult startup companies , design solutions and  handle complex integration and optimization issues.

Courses:
SQL,Scripting, EJB, Java programming, Security for Web applications, ACE, design patterns, Presentation skills,Spring, Advanced C++, Advanced Java, Websphere administration.

Family
Married in 2012 , March in NY,USA.

Hobbies:
Ancient history, Music,Movies, Books.


My Resume:

Feel free to contact me on :
Phone :
From outside of Israel: + 972 544793208
In Israel :                             0544793208
email : rami.mankevich@gmail.com
facebook, linkedin