<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>twodoteau</title>
	<atom:link href="http://ben.timby.com/?feed=rss2&#038;p=195" rel="self" type="application/rss+xml" />
	<link>http://ben.timby.com</link>
	<description>2.0, my second blog</description>
	<lastBuildDate>Wed, 28 Nov 2012 07:18:29 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.1</generator>
		<item>
		<title>Tracking a Python Memory Leak</title>
		<link>http://ben.timby.com/?p=225</link>
		<comments>http://ben.timby.com/?p=225#comments</comments>
		<pubDate>Wed, 28 Nov 2012 07:12:43 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=225</guid>
		<description><![CDATA[I never knew such a thing existed until recently. No garbage collector is perfect, and Python&#8217;s is no exception. It must be pretty damn good to have remained invisible to me for so long. The issue I ran into is well documented, namely, Objects can create a reference cycle that cannot be collected if two [...]]]></description>
			<content:encoded><![CDATA[<p>I never knew such a thing existed until recently. No garbage collector is perfect, and Python&#8217;s is no exception. It must be pretty damn good to have remained invisible to me for so long. The issue I ran into is <a href="http://mflerackers.wordpress.com/2012/04/12/fixing-and-avoiding-memory-leaks-in-python/">well documented</a>, namely, Objects can create a reference cycle that cannot be collected if two or objects in the cycle have __del__() methods.</p>
<p>Specifically, I ran into this using the pyFilesystem library. While not a failing of the library itself, the library makes use of __del__() methods, file systems (the major object type in this library) also reference one another. This makes it very easy to create a reference cycle as described above.</p>
<p>The solution is to be diligent about close()ing the file systems. Each fs has a close() method, but being naive to this problem, and used to the Python gc having my back, I was not diligent. I eventually discovered the problem because our FTP server was exhausting it&#8217;s memory. This problem never turned up in testing because frankly our testing focuses on correctness and does not generate a huge amount of load. By the time this problem made it to production, I had to troubleshoot it there, and fast.</p>
<p>I started by looking for advice from other pyfs users, I turned to the <a href="https://groups.google.com/forum/?fromgroups=#!topic/pyfilesystem-discussion/HhId2dcBUJA">mailing list</a> and received some great information from Ryan Kelly.</p>
<p>However, in the end, the solution was just good old fashioned hard work. I produced a series of patches that corrected my misuse of the library, try&#8230;finally and with &#8230; as &#8230; solved all the simple issues. Some more sophisticated approaches solved the complex ones. But in the end, ensuring the fs instance is closed when you are done with it ensures it frees all it&#8217;s references to other fs instances, breaking the cycle.</p>
<p>The only interesting part, and the reason for this post is that I ended up writing a tool to help troubleshoot this problem. As I mentioned, it made it to production, which is not a good place for debuggers, especially in an FTP server under load. Therefore, I needed a tool that I could invoke against the running daemon, then perform offline analysis without impacting the server. When I looked, I did not find anything, so I created <a href="https://github.com/smartfile/caulk">caulk</a> (for plugging memory leaks).</p>
<p>The main feature of this library and cli tool is that it makes it easy to set up a signal handler. Then you can `kill -usr1` your Python process and analyze the dump file using the caulk command. One late addition (added today) is integration with <a href="http://mg.pov.lt/objgraph/">objgraph</a> (if it is installed), which will also dump some graphs showing the reference to the uncollectable object instances. You can view the graph using the xdot command. This last part is what finally led me to the few remaining code paths that were not properly closing the fs.</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=225</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vagrant + VirtualBox + VPN == DNS FAIL</title>
		<link>http://ben.timby.com/?p=200</link>
		<comments>http://ben.timby.com/?p=200#comments</comments>
		<pubDate>Thu, 24 May 2012 04:12:40 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[networking]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[vpn]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=200</guid>
		<description><![CDATA[I have been using Vagrant to manage development VMs for a few weeks now. It works great. However a recent change to my VPN client configuration caused me problems. I selected the VPN client option to use a remote DNS server. This allows me to access remote hosts by name instead of by IP address [...]]]></description>
			<content:encoded><![CDATA[<p>I have been using Vagrant to manage development VMs for a few weeks now. It works great. However a recent change to my VPN client configuration caused me problems.</p>
<p>I selected the VPN client option to use a remote DNS server. This allows me to access remote hosts by name instead of by IP address or using an /etc/hosts file. It is convenient, but totally broke DNS resolution in my guest VMs.</p>
<p>I did not try to figure out the root cause, but whenever I am connected to the VPN, the guest cannot resolve any hosts. My guests use NAT networking in VirtualBox. The DHCP client on the guest is configuring /etc/resolv.conf with the IP address of my host machine&#8217;s interface on the NAT network.</p>
<p>The quick fix was to add PEERDNS=&#8221;no&#8221; to the guest OS /etc/sysconfig/network-scripts/ifcfg-eth0 file (the guest OS being CentOS). And then hard-coding the DNS server address to something like 8.8.8.8 in /etc/resolv.conf, which if connected to the Internet should always work. I made this change in the VM that I created my basebox from, repackaged it and now any Vagrant VMs derived from it have this change. At least Vagrant made it easy to propagate the change to the other VMs.</p>
<p>I am sure there is a bug somewhere in the DNS resolver or proxy that VirtualBox ships with that is the root cause, but I have been too busy to spend any time digging further.</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=200</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pyftpdlib + PyFilesystem</title>
		<link>http://ben.timby.com/?p=195</link>
		<comments>http://ben.timby.com/?p=195#comments</comments>
		<pubDate>Wed, 25 Apr 2012 15:06:17 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=195</guid>
		<description><![CDATA[The title of this post mentions two great Python libraries which I use heavily. Actually it is a disingenuous to call them libraries, because they are both in fact systems. I use these systems as part of the foundation of SmartFile, which is a Cloud storage provider with FTP, SFTP as well as HTTP and [...]]]></description>
			<content:encoded><![CDATA[<p>The title of this post mentions two great Python libraries which I use heavily. Actually it is a disingenuous to call them libraries, because they are both in fact systems. I use these systems as part of the foundation of <a href="http://www.smartfile.com/">SmartFile</a>, which is a Cloud storage provider with FTP, SFTP as well as HTTP and API file access methods.</p>
<p>Besides my use of both projects, Pyftpdlib and PyFilesystem have another thing in common: great maintainers. I could not have pulled off everything I have with these two projects without the great oversight and direction of Giampaolo Rodola (pyftpdlib) and Will McGugan (pyfilesystem). So a big thanks to both of them is due for their excellent projects and leadership!</p>
<p><a href="http://code.google.com/p/pyftpdlib/">Pyftpdlib</a> consists of a library and tools for building FTP servers in Python. It is based on asyncore and as such provides a very efficient system which can scale to many thousands of simultaneous clients. It also ships with ready made FTP servers that can simply be executed to start serving files over FTP. Pyftpdlib is <a href="http://code.google.com/p/pyftpdlib/wiki/Adoptions">used by many players</a> to bring FTP into myriad Python based systems. The main strength of Pyftpdlib is that is provides an easy method to extend the base FTP server, one can plug in their own file access classes and authentication classes to customize the FTP server. Since the library is written in Python, even beyond the simple plugin system, one can customize the FTP server to any purpose. There are also a number of optional additions provided such as SSL support, and throttling to name two.</p>
<p><a href="http://code.google.com/p/pyfilesystem/">PyFilesystem</a> is a library and tools which abstracts storage interaction. The system consists of a common API that can be utilized for all file interactions, a number of backends which implement this API and store data in myriad systems, as well as a bunch of tools for interacting with the API. For example, using PyFilesystem, one can interact with Amazon&#8217;s S3, Riak, FTP, SFTP, WebDAV, and many other storage systems. All using the same API. Some backends even provide transformation of data, some examples are encrypting or compressed file systems. Some file system backends exist purely for merging multiple other backends together, examples of this are MultiFS, which can &#8220;stack&#8221; multiple file systems into one unified view, and MountFS which can merge multiple file systems into a tree. Many of this functionality can be had using existing tools, but PyFilesystem allows greater flexibility in that all of this functionality is available within a Python application.</p>
<p>The way I use these two projects together should be obvious by now. But I have gone a step further and decided to contribute my code to the PyFilesystem project. <a href="http://code.google.com/p/pyfilesystem/source/browse/trunk/fs/expose/ftp.py">The code I contributed</a> is another in a growling list of &#8220;expose&#8221; wrappers that glue PyFilesystem into other Python projects (Paramiko (SFTP) being another notable one). This should allow others to easily create FTP servers to serve their favorite storage medium.</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=195</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HAProxy Logging Hostname to Syslog.</title>
		<link>http://ben.timby.com/?p=177</link>
		<comments>http://ben.timby.com/?p=177#comments</comments>
		<pubDate>Mon, 28 Nov 2011 22:02:23 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[clustering]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=177</guid>
		<description><![CDATA[By default HAProxy will not log the system hostname to syslog. For me this meant that syslog-ng was inserting &#8220;127.0.0.1&#8243; instead of the machine name. I use the machine name to filter my log entries to specific files. I found an undocumented feature in the latest HAProxy, log-send-hostname which changes this behaviour. http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=df5b38fac1788e6a134095459170a618a1c23388]]></description>
			<content:encoded><![CDATA[<p>By default HAProxy will not log the system hostname to syslog. For me this meant that syslog-ng was inserting &#8220;127.0.0.1&#8243; instead of the machine name. I use the machine name to filter my log entries to specific files.</p>
<p>I found an undocumented feature in the latest HAProxy, log-send-hostname which changes this behaviour.</p>
<p><a href="http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=df5b38fac1788e6a134095459170a618a1c23388">http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=df5b38fac1788e6a134095459170a618a1c23388</a></p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=177</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unnecessary CentOS Services on a VM.</title>
		<link>http://ben.timby.com/?p=174</link>
		<comments>http://ben.timby.com/?p=174#comments</comments>
		<pubDate>Mon, 28 Nov 2011 21:59:55 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[xen]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=174</guid>
		<description><![CDATA[When running CentOS inside a VM, there are a number of services you get that are not needed. I found the following services that I ended up disabling. service smartd stop &#038;&#038; chkconfig smartd off service avahi-daemon stop &#038;&#038; chkconfig avahi-daemon off service cups stop &#038;&#038; chkconfig cups off service autofs stop &#038;&#038; chkconfig autofs [...]]]></description>
			<content:encoded><![CDATA[<p>When running CentOS inside a VM, there are a number of services you get that are not needed. I found the following services that I ended up disabling.</p>
<pre>service smartd stop &#038;&#038; chkconfig smartd off
service avahi-daemon stop &#038;&#038; chkconfig avahi-daemon off
service cups stop &#038;&#038; chkconfig cups off
service autofs stop &#038;&#038; chkconfig autofs off
service iscsid stop &#038;&#038; chkconfig iscsid off
service iscsi stop &#038;&#038; chkconfig iscsi off
service bluetooth stop &#038;&#038; chkconfig bluetooth off
service kudzu stop &#038;&#038; chkconfig kudzu off
service mdmonitor stop &#038;&#038; chkconfig mdmonitor off
service hidd stop &#038;&#038; chkconfig hidd off
service haldaemon stop &#038;&#038; chkconfig haldaemon off
service gpm stop &#038;&#038; chkconfig gpm off
service auditd stop &#038;&#038; chkconfig auditd off
service cpuspeed stop &#038;&#038; chkconfig cpuspeed off
service messagebus stop &#038;&#038; chkconfig messagebus off
service pcscd stop &#038;&#038; chkconfig pcscd off</pre>
<p>These do things like manage devices or file systems. Things a VM won&#8217;t need done. This stopped about 20 processes and freed up a few megabytes of memory. Every little bit helps when you run tens or hundreds of VMs.</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=174</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Logging Apache to syslog.</title>
		<link>http://ben.timby.com/?p=171</link>
		<comments>http://ben.timby.com/?p=171#comments</comments>
		<pubDate>Mon, 28 Nov 2011 21:57:59 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=171</guid>
		<description><![CDATA[Maybe everybody already knows this. I did not. Apache does not log to syslog. It logs to a file or a program. However, util-linux provides a program named logger that writes to syslog. It can be used directly with apache to proxy log entries to syslog. You just need to configure it via the CustomLog [...]]]></description>
			<content:encoded><![CDATA[<p>Maybe everybody already knows this. I did not. Apache does not log to syslog. It logs to a file or a program.</p>
<p>However, util-linux provides a program named logger that writes to syslog. It can be used directly with apache to proxy log entries to syslog. You just need to configure it via the CustomLog directive.</p>
<pre>CustomLog " |/usr/bin/logger -t 'apache'" combined</pre>
<p>I have also seen suggestions to use a dedicated socket to receive Apache log entries. I am not sure what the reason for this is, I would assume that logger uses internal syslog calls by default which would perform better than a UNIX socket. However, if you want to test this method as well, you will need to add a new source to /etc/syslog-ng/syslog-ng.conf:</p>
<pre>source s_apache {
	unix-stream ("/dev/log_apache" max-connections(512) keep-alive(yes));
};</pre>
<p>Then the CustomLog directive in the httpd.conf changes slightly:</p>
<pre>CustomLog " |/usr/bin/logger -t 'apache' -u /dev/log_apache" combined</pre>
<p>Previously, I was using a python script to do the same job. I am glad to retire my python script :-).</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=171</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PDFremix</title>
		<link>http://ben.timby.com/?p=168</link>
		<comments>http://ben.timby.com/?p=168#comments</comments>
		<pubDate>Tue, 15 Nov 2011 22:25:13 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[Django]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=168</guid>
		<description><![CDATA[I launched a new free service today built with open source tools. It is a simple application that allows you to load multiple PDF files and then &#8220;remix&#8221; the pages from them in any order you wish. When you are done remixing, you can download a new PDF file that contains the pages you selected [...]]]></description>
			<content:encoded><![CDATA[<p>I launched a new free service today built with open source tools. It is a simple application that allows you to load multiple PDF files and then &#8220;remix&#8221; the pages from them in any order you wish.</p>
<p>When you are done remixing, you can download a new PDF file that contains the pages you selected in the order you defined.</p>
<p>This allows you to quickly combine a cover letter with another document, or remove some pages from a document before email it etc.</p>
<p>The site is <a title="PDFremix -- Quickly reorder, add, or remove PDF pages." href="http://pdfremix.com">pdfremix.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=168</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building lftp.exe On Windows.</title>
		<link>http://ben.timby.com/?p=158</link>
		<comments>http://ben.timby.com/?p=158#comments</comments>
		<pubDate>Fri, 24 Jun 2011 18:12:35 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=158</guid>
		<description><![CDATA[An associate of mine needed to synchronize their Windows server backup directory to an off-site FTP server. The easiest way I know to do this is to use lftp. However, Windows is not officially supported by the lftp author. Therefore, you can either hunt down a version that someone compiled and bundled with the cygwin [...]]]></description>
			<content:encoded><![CDATA[<p>An associate of mine needed to synchronize their Windows server backup directory to an off-site FTP server. The easiest way I know to do this is to use lftp. However, Windows is not officially supported by the lftp author. Therefore, you can either hunt down a version that someone compiled and bundled with the cygwin .dll files, or you can install cygwin and it&#8217;s version of lftp.</p>
<p>My associate wanted to use the latest version of lftp, which is not available using either one of these methods. Therefore I built version 4.3.0 (lastest as of now) for him on Cygwin. Below are the simple steps required to do so.</p>
<ol>
<li>Download cygwin&#8217;s <a href="http://www.cygwin.com/setup.exe">setup.exe</a>.</li>
<li>Install your toolchain: bison, autoconf, gcc, gcc-c++, make.</li>
<li>Install the provided version of lftp (4.2.3-1) this will pull in all the required dependencies.</li>
<li>Install GNUTLS: gnutls &amp; gnutls-devel.</li>
<li>You can now extract the lftp sources and build using:
<pre>$ ./configure &amp;&amp; make</pre>
</li>
<li>Once the build was successful, I simply determined the dependencies, then gathered the .exe file and all required .dll files into a folder and zipped them up. The following command will show all the non-Windows (cygwin) dependencies:
<pre>$ ldd lftp.exe | grep -v WINDOWS</pre>
</li>
</ol>
<h2>&#8211; OR &#8211;</h2>
<p>You can <a href="http://ben.timby.com/pub/lftp-4.3.0.zip">download</a> the version I built for him and save some time :-).</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=158</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache mod_xsendfile and non-ASCII filenames.</title>
		<link>http://ben.timby.com/?p=149</link>
		<comments>http://ben.timby.com/?p=149#comments</comments>
		<pubDate>Mon, 14 Mar 2011 17:03:24 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[Django]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=149</guid>
		<description><![CDATA[If you are using a web development framework like Django, you know that it is not efficient at serving static files. However, if you need to protect access to the files to only specific users, and you want to use some code to do so, you need to find a solution that allows your Django [...]]]></description>
			<content:encoded><![CDATA[<p>If you are using a web development framework like Django, you know that it is not efficient at serving static files. However, if you need to protect access to the files to only specific users, and you want to use some code to do so, you need to find a solution that allows your Django application to do the authorization, while offloading the sending of the file to an external server.</p>
<p>If you are using Apache as a webserver, then one choice is to use mod_xsendfile, which allows you to set a header, instructing Apache to efficiently send the file to the client. This gives you the best of both worlds.</p>
<p>However, when dealing with an international application where filenames are determined by customers, you will often have file names that contain non-ASCII characters. In this case, Django will raise an exception when you set the X-SendFile header:</p>
<pre>UnicodeEncodeError: 'ascii' codec can't encode character u'\uf026' in position 74:
ordinal not in range(128), HTTP response headers must be in US-ASCII format</pre>
<p>This is because HTTP headers must be ASCII only. This problem is <a href="http://stackoverflow.com/questions/1156246/having-django-serve-downloadable-files">well</a> <a href="http://www.google.com/search?q=http+headers+non-ascii">documented</a>. So, what is the solution?</p>
<p>My solution was to patch mod_xsendfile so that it can accept an additional header, X-SendFile-Encoding, instructing it to decode the file name before use. In this manner, the Django application can encode the filename into ASCII, send it to the module, which will then decode it and send the file. The encoding scheme I selected is url encoding. Given that my file system encoding is UTF8, the full solution is:</p>
<pre>response['X-SendFile-Encoding'] = 'url'
response['X-SendFile'] = urllib.quote(path.encode('utf8'))</pre>
<p>The patch is available at: <a href="http://ben.timby.com/pub/mod_xsendfile-url_encoding.patch">http://ben.timby.com/pub/mod_xsendfile-url_encoding.patch</a><br />
I have also built an RPM for CentOS at: <a href="http://dagobah.ftphosting.net/yum/mod_xsendfile-0.11.1-5.x86_64.rpm">http://dagobah.ftphosting.net/yum/mod_xsendfile-0.11.1-5.x86_64.rpm</a><br />
And of course the SRPM: <a href="http://dagobah.ftphosting.net/yum/SRPMS/mod_xsendfile-0.11.1-5.x86_64.rpm">http://dagobah.ftphosting.net/yum/SRPMS/mod_xsendfile-0.11.1-5.x86_64.rpm</a></p>
<p>I hope this helps somebody!</p>
<p><strong>Update</strong></p>
<p>The author of mod_xsendfile implemented a better version of my patch. Now by default the header&#8217;s value will be decoded (causing no problems for non-encoded values). This behavior can be disabled using an optional configuration flag (XSendFileUnescape off).</p>
<p><a href="https://github.com/nmaier/mod_xsendfile/commit/b98d2d1df9f7acd720bf082e32b1392188a23379">https://github.com/nmaier/mod_xsendfile/commit/b98d2d1df9f7acd720bf082e32b1392188a23379</a></p>
<p><a href="https://github.com/nmaier/mod_xsendfile/commit/0efcd03ac196930da6b139b77972c0d430e0225c">https://github.com/nmaier/mod_xsendfile/commit/0efcd03ac196930da6b139b77972c0d430e0225c</a></p>
<p>Thank you Nils!</p>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=149</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Efficient FIFO Buffer in Python.</title>
		<link>http://ben.timby.com/?p=139</link>
		<comments>http://ben.timby.com/?p=139#comments</comments>
		<pubDate>Fri, 14 Jan 2011 16:23:06 +0000</pubDate>
		<dc:creator>btimby</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ben.timby.com/?p=139</guid>
		<description><![CDATA[The Problem. I am working on an client/server application that transfers huge amounts of data over a network. The server produces data and then it passes through a processing pipeline before being transmitted. The client receives the data and then pulls it through a similar processing pipeline before parsing and saving the result. The transport [...]]]></description>
			<content:encoded><![CDATA[<h2>The Problem.</h2>
<p>I am working on an client/server application that transfers huge amounts of data over a network. The server produces data and then it passes through a processing pipeline before being transmitted. The client receives the data and then pulls it through a similar processing pipeline before parsing and saving the result.</p>
<p>The transport is HTTP and the processing pipeline consists of encoders such as a chunked transfer encoder/decoder and a deflate/inflate filter. While building this project, I had need of an efficient FIFO buffer to be used by each of the links in the processing pipeline. For example, the deflate filter buffers data until it has received a defined &#8216;block size&#8217; of data before compressing the block and passing it down the chain. This is done for efficiency reasons, compressing 100 small buffers takes longer than compressing one large buffer. The chunked transfer decoder also needed a FIFO buffer, as it parses the data block by block. When reading from the previous link in the pipeline, sometimes the read would end in the middle of a structure. In that case, the read data must be buffered until the remainder of the structure is available upstream.</p>
<p>Because I am working with multi-gigabyte data streams, I can&#8217;t afford to buffer the entire thing, especially on the client. As data arrives, I want to feed it into one end of a FIFO buffer, and then consume from the other end. The obvious choice here is to use StringIO (cStringIO) however, this is not suitable for one simple reason: StringIO does not provide a means of truncating the BEGINNING of the buffer. You can write data to the tail, consume it from the head, but you cannot discard data after you read it.</p>
<p>The simple solution is to simply append data blocks to a list when writing, and pop them from the beginning when reading. This simple approach works, but is very slow. I present below the solution I am using now which is about 10x faster. It works using the same idea, but the list contains StringIO instances rather than large data blocks in the form of strings.</p>
<h2>The Solution.</h2>
<p>The Buffer object starts off with an empty buffer list (self.buffers = []). When data arrives, the buffer will append a new StringIO instance to this list. As more data arrives, it is appended to this StringIO instance. Once the StringIO instance reaches a threshold, another StringIO instance is appended to the list and receives subsequent data. My testing uses a 4MB threshold for the StringIO size limit.</p>
<p>As data is read, it is read from the oldest StringIO instance. Once that instance is exhausted, it is removed from the list, freeing up the memory it consumed. Subsequent reads are then satisfied by the next oldest StringIO instance until all of them are exhausted. Once all StringIO&#8217;s are exhausted the list is cleared and the whole process starts again.</p>
<p>This provides for an elastic FIFO buffer that only keeps data resident until it is consumed. Once consumed, the data is eventually freed (in increments of 4MB). My implementation also uses a Lock to create a critical section when reading/writing the buffers. This is required in my project since the buffer is used to move data between threads (some processing pipeline elements have dedicated threads).</p>
<pre>import threading
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO

MAX_BUFFER = 1024**2*4

class Buffer(object):
    def __init__(self, max_size=MAX_BUFFER):
        self.buffers = []
        self.max_size = max_size
        self.lock = threading.Lock()
        self.closing = False
        self.eof = False
        self.read_pos = 0
        self.write_pos = 0

    def write(self, data):
        self.lock.acquire()
        try:
            if not self.buffers:
                self.buffers.append(StringIO())
                self.write_pos = 0
            buffer = self.buffers[-1]
            buffer.seek(self.write_pos)
            buffer.write(data)
            if buffer.tell() >= self.max_size:
                buffer = StringIO()
                self.buffers.append(buffer)
            self.write_pos = buffer.tell()
        finally:
            self.lock.release()

    def read(self, length=-1):
        self.lock.acquire()
        read_buf = StringIO()
        try:
            remaining = length
            while True:
                if not self.buffers:
                    break
                buffer = self.buffers[0]
                buffer.seek(self.read_pos)
                read_buf.write(buffer.read(remaining))
                self.read_pos = buffer.tell()
                if length == -1:
                    # we did not limit the read, we exhausted the buffer, so delete it.
                    # keep reading from remaining buffers.
                    del self.buffers[0]
                    self.read_pos = 0
                else:
                    #we limited the read so either we exhausted the buffer or not:
                    remaining = length - read_buf.tell()
                    if remaining > 0:
                        # exhausted, remove buffer, read more.
                        # keep reading from remaining buffers.
                        del self.buffers[0]
                        self.read_pos = 0
                    else:
                        # did not exhaust buffer, but read all that was requested.
                        # break to stop reading and return data of requested length.
                        break
        finally:
            self.lock.release()
        return read_buf.getvalue()

    def __len__(self):
        len = 0
        self.lock.acquire()
        try:
            for buffer in self.buffers:
                buffer.seek(0, 2)
                if buffer == self.buffers[0]:
                    len += buffer.tell() - self.read_pos
                else:
                    len += buffer.tell()
            return len
        finally:
            self.lock.release()

    def close(self):
        self.eof = True</pre>
]]></content:encoded>
			<wfw:commentRss>http://ben.timby.com/?feed=rss2&#038;p=139</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
