MessiandNeymar

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, June 8, 2012

Your Friday afternoon reading list

Posted on 4:07 PM by Unknown

Here you go, just what you were waiting for :)

  1. A nice explanation of why it was rather challenging to build index scans into the Postgres MVCC engine: http://michael.otacoo.com/postgresql-2/postgresql-9-2-highlight-index-only-scans/
    The commit message talks about “visibility map”, which is a feature implemented since PostgreSQL 8.4, which allows to keep tracking of which pages contains only tuples that are visible to all the transactions (no data modified since latest vacuum cleanup for example). What this commit simply does is to check if the page that needs to be consulted is older than the transaction running.
  2. A simple introduction to Postgres's query timing features: http://momjian.us/main/blogs/pgblog/2012.html#June_8_2012
    Each data manipulation language (dml) command (select, insert, update, delete) goes through three stages:
    1. parser
    2. planner
    3. executor
    You can actually time how long each stage takes.
  3. A super-awesome discussion of the intricacies of Postgres's 9.2 Group Commit algorithm, and a change that is in the hopper for Postgres 9.3: http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html
    Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL.
  4. A great essay on why your storage system really needs to be sensitive to whether it uses SSD hardware underneath: http://blog.empathybox.com/post/24415262152/ssds-and-distributed-data-systems
    Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!
  5. One of the best sharding presentations I've seen yet, from the Tumblr team: https://github.com/tumblr/jetpants/blob/master/doc/VelocityEurope2011Presentation.pdf?raw=true
    Sharding is the implementation of horizontal partitioning outside of MySQL (at the application level or service level). Each partition is a separate table. They may be located in different database schemas and/or different instances of MySQL.
  6. Also from the Tumblr gang, a nifty short note about using parallel gzip and netcat to get the fastest possible transfer of immense data sets between nodes: http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to-multiple-destinations
    By adding tee and a FIFO to the mix, you can create a fast copy chain: each node in the chain saves the data locally while simultaneously sending it to the next server in the chain.
  7. In the most recent issue of IEEE Computer, Prof. Eric Brewer revisits his famous CAP theorem, 12 years later: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
    The challenging case for designers is to mitigate a par­tition’s effects on consistency and availability. The key idea is to manage partitions very explicitly, including not only detection, but also a specific recovery process and a plan for all of the invariants that might be violated during a partition.
  8. Twitter have open-sourced their distributed trace infrastructure, designed for real-time tracing of service-oriented architectures under actual load: http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html
    Zipkin started out as a project during our first Hack Week. During that week we implemented a basic version of the Google Dapper paper for Thrift. Today it has grown to include support for tracing Http, Thrift, Memcache, SQL and Redis requests.
    Here's Google Dapper, by the way, if you haven't seen it before: http://research.google.com/pubs/pub36356.html
  9. A great checklist if you find yourself considering the problem of, say, loading 500 million rows into your database, all at once: http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql
    Use LOAD DATA INFILE

    This is the most optimized path toward bulk loading structured data into MySQL. 8.2.2.1. Speed of INSERT Statements predicts a ~20x speedup over a bulk INSERT (i.e. an INSERT with thousands of rows in a single statement).

    LOAD DATA INFILE is the MySQL equivalent of Postgres's COPY FROM, I believe.
  10. A somewhat-marketing-slanted post about why you shouldn't expect any shared-disk database system to ever perform very well: http://database-scalability.blogspot.com/
    With a shared disk - there is a single shared copy of my big data on the shared disk, the database engine still have to maintain "buffer management, locking, thread locks/semaphores, and recovery tasks". But now - with a twist! Now all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Every database server in the cluster needs to update all other nodes for every "buffer management, locking, thread locks/semaphores, and recovery tasks" it is doing on a block of data.
  11. A well-written plea to never poll, and never spin, and always use a real lock manager or real synchronization system instead: http://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/
    The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.
  12. A fascinating and detailed article about trying to use the Linux "perf" toolkit to profile and measure complicated system software (in this case, the Postgres engine) http://rhaas.blogspot.com/2012/06/perf-good-bad-ugly.html
    perf seems to be the best of the Linux profiling tools that are currently available by a considerable margin, and I think that they've made good decisions about the functionality and user interface. The lack of proper documentation is probably the tool's biggest weakness right now, but hopefully that is something that will be addressed over time.
  13. An intriguing introduction to some of the searching issues that come up in genome sequencing, which, at its core, bears a lot of resemblance to the problem of computing diffs between two versions of an object. http://williamedwardscoder.tumblr.com/post/24071805525/searching-for-substrings-in-a-massive-string
    How might you search a really big string - a few billion symbols - for the best alignment of a large number of short substrings - perhaps just a few hundred symbols long or less - with some allowed edit-distance fuzzy-matching?
    Includes some interesting pointers to the Bowtie project, which I had not previously encountered: http://bowtie-bio.sourceforge.net/index.shtml
  14. And, lastly, since no list should end with 13 items, from the University of Ulm, an entire PhD thesis entitled "Concurrent Programming for Scalable Web Architectures": http://berb.github.com/diploma-thesis/original/index.html

    Chapter 6 is, naturally, of particular interest to me, being a storage systems kinda guy:

    In this chapter, we consider the impact of concurrency and scalability to storage backends. We illustrate the challenge of guaranteeing consistency in distributed database systems and point to different consistency models. We then outline some internal concepts of distributed database systems and describe how they handle replication and partitioning.

Enjoy!

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Shelter
    I meant to post this as part of my article on Watership Down , but then totally forgot: Shelter In Shelter you experience the wild as a moth...
  • The Legend of 1900: a very short review
    Fifteen years late, we stumbled across The Legend of 1900 . I suspect that 1900 is the sort of movie that many people despise, and a few peo...
  • Rediscovering Watership Down
    As a child, I was a precocious and voracious reader. In my early teens, ravenous and impatient, I raced through Richard Adams's Watershi...
  • Must be a heck of a rainstorm in Donetsk
    During today's Euro 2012 match between Ukraine and France, the game was suspended due to weather conditions, which is a quite rare occur...
  • Beethoven and Jonathan Biss
    I'm really enjoying the latest Coursera class that I'm taking: Exploring Beethoven’s Piano Sonatas . This course takes an inside-out...
  • Starting today, the games count
    In honor of the occasion: The Autumn Wind is a pirate, Blustering in from sea, With a rollocking song, he sweeps along, Swaggering boisterou...
  • Parbuckling
    The enormous project to right and remove the remains of the Costa Concordia is now well underway. There's some nice reporting on the NP...
  • For your weekend reading
    I don't want you to be bored this weekend, so I thought I'd pass along some articles you might find interesting. If not, hopefully y...
  • Are some algorithms simply too hard to implement correctly?
    I recently got around to reading a rather old paper: McKusick and Ganger: Soft Updates: A Technique for Eliminating Most Synchronous Writes ...
  • Don't see me!
    When she was young, and she had done something she was embarrassed by or felt guilty about, my daughter would sometimes hold up her hand to ...

Blog Archive

  • ►  2013 (165)
    • ►  September (14)
    • ►  August (19)
    • ►  July (16)
    • ►  June (17)
    • ►  May (17)
    • ►  April (18)
    • ►  March (24)
    • ►  February (19)
    • ►  January (21)
  • ▼  2012 (335)
    • ►  December (23)
    • ►  November (30)
    • ►  October (33)
    • ►  September (34)
    • ►  August (29)
    • ►  July (39)
    • ▼  June (27)
      • Offline for a while...
      • It's not just a movie...
      • Version Control for database stored procedures
      • 2012 USENIX Federated Conferences Week
      • That's a long one, all right!
      • Lots of interesting things to read about Flame
      • It's summer in California
      • Russia plummets to earth
      • Are the new TLDs a land grab?
      • Stuff I'm reading on a Friday afternoon
      • Must be a heck of a rainstorm in Donetsk
      • It's not just a game ...
      • Contented by life's subtleties
      • Some things take a long time; some things go very ...
      • Dig deep for knowledge
      • Unbroken: a very short review
      • Your Friday afternoon reading list
      • The extra cookie
      • Whales and Martingales
      • Wrapping up the Anand-Gelfand match
      • 72 hours until Euro 2012 begins!
      • Signed and unsigned integers in the C programming ...
      • Facebook Folly
      • Bryan is 50!
      • Trying to digest Oracle v. Google
      • Who says you can't learn anything useful on the In...
      • Sanger on Stuxnet in the NYT
    • ►  May (48)
    • ►  April (32)
    • ►  March (30)
    • ►  February (10)
Powered by Blogger.

About Me

Unknown
View my complete profile