Ruby and Apple

Last week the Ruby world was upside-down and because of some security warning that Apple released about some Ruby-Security issue. It turns out that this is all wrong and not as bad as it seems. Sorry but the Ruby-Guys at Apple are _total_ Morons! And the Japanese as polite as they are, are just too kind! Thank you Matz! Apple deserves a slap across the face for this one!

This is so classical!

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/17427

And from the Gentoo Bug List:

https://bugs.gentoo.org/show_bug.cgi?id=225465

Heap fragmentation in a long running Ruby process

 

Abstract

In a long-running ruby process with a highly dynamic object-space, we encountered performance degradation and finally memory-allocation failure due to heap fragmentation. The problem can be mitigated by linking ruby against ptmalloc3.

 

Hi all! I’m writing this mail in the hope that my experiences may point you in the right direction, if you ever encounter a similar problem. Naturally I would be delighted to read your comments and advice on my conclusions and the steps taken.

 

http://ch.oddb.org [1] provides information on the swiss health-care market. Behind an Apache/mod-ruby setup lies a single ruby-process, which acts as a DRb-Server. Predating Ruby on Rails, the application is based on self-baked libraries [2-4].

 

A couple of weeks ago we experienced a spike in user requests. Although the application seemed to scale well most of the time, we began experiencing outages after a couple of hours. Whenever that happened, CPU-Load rose to 100% and DRb-Requests were hanging, sometimes for several minutes. At the same time, memory usage started rising considerably. If left to run for enough time, the application would crash with a NoMemoryError: ‘Failed to allocate Memory’ – even though there was still plenty of Memory available in the system.

 

Thanks to Jamis Buck [5] and Mauricio Fernandez [6] I was able to determine that the application was stuck for several seconds in glibc’s realloc, which may be called (via ruby_xrealloc) from basically anywhere within ruby where a new or enlarged chunk of memory might be required.

 

Having stated the diagnosis: heap fragmentation [7], there were a couple of things I could try to improve the performance of our application, all revolving around the principle of creating fewer objects, and in particular fewer Strings, Arrays and Hashes. By eliminating a number of obvious suspects (mainly to do with the on-demand sorting of values stored in a large Hash), I was able to raise the life-expectancy of our application considerably – close, but no cigar.

 

And then – all praise bugzilla – I found a bugreport [8] describing almost exactly my problems and leading me to ptmalloc3 [9]. Glibc’s malloc implementation is based on ptmalloc2, and may be replaced by simply linking ruby against ptmalloc3.

 

As far as I understand, ptmalloc3 does not eliminate heap fragmentation. However, due to the bit-wise tree employed in the newer version, it finds free chunks of the right size in shorter time by several orders of magnitude. Additionally, it seems that glibc 2.5 abandons its attempts to find a best-fit chunk after a while (possibly after 10000 tries), instead expanding the heap as long as possible and finally failing to allocate memory – causing first the fast rise in memory usage and later the observed NoMemoryError.

 

At this time, http://ch.oddb.org has run – powered by ruby and ptmalloc3 – for a little more than 24 hours without displaying any of the signs I have come to associate with heap fragmentation. Significantly less time is spent in allocating memory – and consequently in GC, and the overall memory-footprint has decreased by about 30%.

 

I hope this is of use – thanks in advance for any thoughts you want to share.

Hannes Wyss

[1] Open Drug Database
http://scm.ywesee.com/?p=oddb.org;a=summary
[2] Object-Database Access and Object Cache
http://scm.ywesee.com/?p=odba;a=summary
[3] State-Based Session Management
http://scm.ywesee.com/?p=sbsm;a=summary
[4] Component-Based Html generator
http://scm.ywesee.com/?p=htmlgrid;a=summary
[5] Inspecting a live ruby process, Jamis Buck
http://weblog.jamisbuck.org/2006/9/22/inspecting-a-live-ruby-process
[6] Ruby live process introspection, Mauricio Fernandez
http://eigenclass.org/hiki.rb?ruby+live+process+introspection
[7] Heap fragmentation, Bruno R. Preiss
http://www.brpreiss.com/books/opus8/html/page425.html
[8] Glibc bugzilla report 4349, Mingzhou Sun, Tomash Brechko
http://sourceware.org/bugzilla/show_bug.cgi?id=4349
[9] Ptmalloc home, Wolfram Gloger
http://www.malloc.de/en/

ODBA Verbesserungen

Aus dem Mail von Hannes Wyss an die ywesee interne Liste:

Der Aktuelle commit der ODBA
beinhaltet mehrere Verbesserungen und zwei Bugfixes, die insgesamt die
langfristige Memory-Auslastung kontrollieren sollen. Endgültige
Bestätigung werden wir erst aus dem online-Dauerbetrieb erhalten.

  • der Cleaner-Thread im ODBA.cache läuft mit höherer (normaler) Priorität und häufiger, dafür aber für kürzere Zeit. D.h. konkret dass im Zeitraum von ca 10 Sekunden jeweils 500 Objekte überprüft und gegebenenfalls aus dem Cache gelöscht werden.
  • Bugfix: wenn Collection-Elemente einzeln aus der DB geladen werden, werden sie neu auch im Cache registriert.
  • CacheEntry führt darüber Buch, welche Objekte auf ein bestimmtes anderes Objekt zugreifen. Neu wird dies nicht mehr direkt über die Referenz gemacht, sondern über odba_id/object_id – damit ist diese Information garantiert kein Hindernis für die GC mehr – bis jetzt wurden die Referenzen jeweils rechtzeitig entfernt und ‘sollten’ auch keine Rolle gespielt haben, jetzt gibts dafür eine Garantie.
  • Ebenfalls aus CacheEntry entfernt wurde der @collection-Eintrag. Beim Speichern einer Collection ist es notwendig zu wissen, welche Elemente der Collection bereits in der Datenbank liegen, welche gelöscht werden müssen, und welche neu hinzukommen. Dies wurde bis jetzt mit eben diesem @collection-Eintrag gelöst. Neu werden die bestehenden Daten jeweils direkt von der DB bezogen.
  • Bugfix: die Ausgabe einer Fehlermeldung führte bei einer speziellen Konstellation zu einem Memory-Spike. (konkret: Narcotic#to_s ist abhängig von den Substanzen in Narcotic@substances. Bei einigen instanzen von Narcotic war das entsprechende Objekt aber gelöscht oder nie gespeichert worden. Dies sollte eigentlich mit einer Fehlermeldung vermerkt werden; da die Fehlermeldung aber ODBA::Stub@container.to_s als Bestandteil hatte, ergab sich ein unendlicher Loop von Exceptions) Dieser Bug nahm die Hälfte der 3 Tage in Anspruch. Ich konnte ihn schlussendlich nur dank dieses Tools finden.

Observations about ODBA, Ruby 1.8.6 and ODDB.org

1. I’m following the memory consumption of our application ODDB.org, Ruby 1.8.6 and our OpenSource RAM-Cache-Management Software ODBA. When we have over 300 open Sessions it seems that that the GC takes far longer then 20 secs to do its job. At around 390 simultaneously open sessions my “Fasterfox” measured way over 60 sec until the application ‘returned’ in a responsive way to the user.

2. Somehow I also have the feeling that when we use more memory the GC runs faster. But that is just an assertion.

3. You can watch our memory consumption here.

4. One thing I can also say for sure is that Ruby 1.8.6 can deal with tons of more traffic. Somehow the GC has really been optimized.

5. We now know that we have some duplication of objects that we do not want to have. The objects have been identified but we do not yet know where they come from.

Some more observations on Ruby 1.8.6 and ODDB.org

Ok, since we installed Ruby 1.8.6 the GC (Garbage Collection) does not take 50 secs (or more) to do its job when our Application is around 2 GB big. The time is down to about 20 secs – and – the Speed at witch the queries are delivered is up up up! Thank you for fixing this, Dear Ruby Community.

Update: I must actually elaborate a bit. The GC used to force us to do a restart because it took such a long time to do its job. With Ruby 1.8.6 the memory usage still “grows” throughout the day. This just has less impact on our Service as we have 12 GB of memory on your server. We still want to find out what increases the memory consumption of our software, though. We owe that answer to Ryan Davis.