Welcome to MSDN Blogs Sign in | Join | Help

Since i'm on vacation next week, i thought i'd tackle something light this week.

Last time i talked about GetScatterGatherList and PutScatterGatherList and how much better they are than the older method of doing DMA.  But as much as I like these two functions, they have one major problem that hit us while we were working on the storage stack - they allocate memory.

During the development of Windows XP one of our goals in the storage team was to ensure we could successfully issue disk I/O operations even if the pool was exhasted or the system address space was full.  In Windows even kernel memory is pageable, and when you can't page in a kernel thread's stack the system can't really continue.  The kernel can crash if a page-in fails due to a bad block on disk, or due to some driver returning STATUS_INSUFFICIENT_RESOURCES.

AllocateAdapterChannel & MapTransfer would work for us, but the performance isn't good on modern systems since you can't issue more than one request at a time through the channel.  We needed something new.

The trick to making forward progress even when you can't freely allocate memory is to preallocate all the resources you need for at least one I/O operation for use in emergencies.  When you get an I/O you try to allocate what you need, and if you can't get it you try to use the "emergency" resources.  If those aren't available you queue the request for later processing when the emergency resources are free.

In order to do this around DMA, we needed the ability to pre-allocate a scatter-gather list, then hand that to the DMA engine to fill in.  This is exactly what BuildScatterGatherList does - it constructs the SG list within the supplied buffer but otherwise acts just like GetScatterGatherList.

There's only one problem.  GetScatterGatherList doesn't just allocate space for your scatter gather list.  It also allocates private memory so that it can track the DMA mapping operation - list entries for enqueing it, the map register base, the number of map registers - all of those things you would normally have to keep track of yourself.  Obviously BuildScatterGatherList can't allocate memory, and your driver shouldn't have to guess how much extra space it might need.  So how do you know how big to make the buffer you hand in?

You find that out by calling CalculateScatterGatherList().  It takes a CurrentVa and a Length along with an optional MDL.  This function determines the size of buffer that BuildScatterGatherList requires.  If you provide an MDL the function will compute the required size for any chained MDLs as well.  If you provide NULL for the MDL then it uses CurrentVa and Length to determine how many pages you're transferring & determines the rest from there.

With these two functions you can ensure that you'll always have enough memory to handle one DMA mapping for some reasonable sized I/O operation (where that reasonable size is whatever you passed in for Length).

I was playing around with the CSS override stuff today.  I hope i haven't blinded anyone.

Yesterday i talked about how to do your DMA operations the old way.  And it's painful.  Very, very painful.  Fortunately for those of you with bus-mastering controllers there's a much easier way.


AllocateAdapterChannel suffers from some real problems.  It can't fail if the system is low on memory, so it can't allocate memory in order to track more than one request at a time.  It was designed to manage slave-mode devices and that makes it clumsy for more modern controllers.  Fortunately there's an alternative - GetScatterGatherList (and it's cousin BuildScatterGatherList, which i'll discuss tomorrow).

GetScatterGatherList replaces both the AllocateAdapterChannel and MapTransfer methods.  It allocates the scatter-gather list for you and fills it in so your AdapterListControl routine (equivalent of the ExecutionRoutine) only has to start the transfer.  When you're done you call PutScatterGatherList to flush the buffers and free the map registers.

Additionally since GetScatterGatherList does allocate memory, it can keep track of more than one request for mapping at a time.  This means you don't need to serialize your calls to it (though you still need to call it at DISPATCH_LEVEL).

Finally GetScatterGatherList will handle chained MDLs as long as the sum total of map registers required by all the MDLs doesn't exceed the number you allocated.  This is another terrific simplification.

So what's the downside of calling GetScatterGatherList?  There are two - one small and one big.  The small disadvantage is that it does a bunch of work up front to map the transfer.  If you absolustely must map in small chunks to get your transfer started faster then AllocateAdapterChannel+MapTransfer is the better choice.  Still this isn't a big deal.

The much bigger problem is that GetScatterGatherList allocates memory, which means that it can fail if resources are low.  If your driver is trying to ensure forward progress in low memory conditions (meaning you don't just fail when you can't allocate pool) then GetScatterGatherList is going to be a problem.  Fortunately there's an alternative - BuildScatterGatherList - which i'll talk about next time.

So to summarize - forget everything i said last time about AllocateAdapterChannel & MapTransfer.  It's far too complicated and provides little benefit.  [Get|Build]ScatterGatherList is a much better option.

-p

 

What Is DMA (Part 6) - Scatter Gather DMA the "old" way

To be honest, it has been a long, long time since i've needed to support slave-mode DMA or packet-based non-scatter-gather DMA. To talk about those i'd probably have to do some (gulp) research. Also I'm not sure how much they apply to modern hardware. It seems pretty cheap these days to buy a DMA controller that can handle scatter-gather for your device. So I'll start there.


Way back when, in Windows NT 3.51 and 4.0 (i've never had to support anything earlier), there was one set of DDIs to do all your DMA functions. The sequence of operations went something like this:

  • Push your I/O through a device queue or the start-IO queue

    The old DMA DDIs can only handle processing one request at a time for any given adapter object.  The resource used to keep track of the map register allocation is stored in the adapter object and there's only one of them.

    The window for serialization is between your call to AllocateAdapterChannel and your call to FreeMapRegisters.  Once you've called FreeMapRegisters you can invoke AllocateAdapterChannel again (which you may do indirectly by calling IoStartNextPacket which runs your StartIO routine, which calls AllocateAdapterChannel again).

  • Call KeFlushIoBuffers to flush any cached data:

    This may not seem necessary on a platform which is cache coherent with respect to DMA, but there are still good reasons to call it.  First, how would your driver know that it's on such a platform (we'll assume that the system does something special with uncached common-buffer)?  Second, pushing data out of the processor caches will help avoid stalls in your DMA operations.  Finally - if it's not really necessary then it's very likely a no-op, so just call it.

  • Call AllocateAdapterChannel to request map registers:

    This method can be found on your DMA_ADAPTER object. You give this the number of map registers that you require and an ExecutionRoutine + Context. When the number of map registers you requested are available, the DMA API will call your ExecutionRoutine and provide the specified context. Your execution routine will prepare the buffers and start the DMA transfer. Generally you would pass in your IRP as the context, but it could be any data structure that can tell your execution routine what to do.

    How many map registers should you ask for? Remember that each map register allows you to transfer one page of data at a time. You can determine how many pages you need with some annoying math , or you can simply use the ADDRESS_AND_SIZE_TO_SPAN_PAGES macro. This takes an address and a length and determines how many physical pages that buffer spans (a two byte buffer at address 0x8000ffff would span two pages). If you were working with a chained MDL, you would need to call this macro for each MDL in the chain and add the page count together.

    You might think you could map less than the full transfer if you're only transferring say one page at a time.  But in asking the DMA folks about this, it becomes very very complicated to keep track of the map registers.  So just map the whole thing at once and get it over with.  You'll have a simpler driver and you'll be happier.

  • Your ExecutionRoutine sets up the transfer:

    When your ExecutionRoutine is invoked it's job is to get logical addresses for your buffer and to start the DMA transfer on your device.  You might be able to do this all at once, or you might need to "stage" the transfer and do it in page-sized chunks.  Either way the steps are more or less the same.

    You'll need to save the map register base in your device extenson, since you'll need it when the transfer is done.  If you are going to transfer in chunks then you will also want to save the list of logical addresses & lengths that you get from mapping the buffer (we'll call this your scatter gather list), an index or pointer into that list so you know where you left off, and the number of bytes you have left to transfer.  For your scatter gather list you'll need one entry (PHYSICAL_ADDRESS and length) for each map register you were granted.

    Next you loop through the buffer, calling MapTransfer to turn each physical fragment into a logical address & length.  Since you setup the DMA_ADAPTER with ScatterGather set to TRUE, MapTransfer will return logical address ranges in small chunks.  You'll save these in the scatter gather list.

    As you iterate through the buffer you need to track your "CurrentVa".  The CurrentVa is is the offset into the buffer and should be (offset +  MmGetVirtualAddress(Mdl)).  It's not a straight offset and this can be a royal pain.  In the past i've stored an "offset" in my device extension and then i do this math each time to compute the CurrentVa.

    You do not need to worry about the map register base - the DMA DDI keeps track of which map registers you've used as you map.

    Once you've mapped enough you can program your device to start the DMA transfer.  Use the logical addreses you received to setup the transfer, start the device running, and return from your ExecutionRoutine (i'll explain the possible return values below).  When your device is done you'll notice somehow (probably an interrupt) and can schedule a DPC to either start the next stage, or to free resources and start the next request.

    You do not block in your execution routine.  The device should be able to run the DMA transfer on its own and notify your driver when it's complete.

    When your ExecutionRoutine is ready to return you have three options for return status.  KeepObject is only for slave-mode DMA where you need to keep the actuall DMA controller allocated for you.  For a bus-master you can either return DeallocateObject or DeallocateObjectKeepRegisters.  Since you can't block in your ExecutionRoutine you would only return the first value if you were aborting your DMA transfer and no longer needed the map registers.  Otherwise keep them until you're done with the transfer.

  • Handle the next stage (optional and may repeat)

    If you are staging the transfer then you'll want to start the next stage when the current one completes.  Use the values you saved in the execution routine (scatter gather list, offset into the scatter-gather list and number of bytes left to transfer) to start the next segment.

    Eventually you hit the last stage (if you did everything in one shot this is also the first stage :) ), or you decide the transfer has failed.  Either way that takes us to the last steps

  • Undo the DMA mappings

    When you're done with your transfer you need to undo the mappings you created above.  To do this you call FlushAdapterBuffers, providing it with the map register base, the MDL, the CurrentVa to start at and the number of bytes transferred.

    You should only call FlushAdapterBuffers once at the end of the transfer.  If you've read the list of operations above, you might realize that you could call MapTransfer before each stage rather than doing it all at once.  That's true, but you should still flush once at the end.  Otherwise the DMA DDI can get confused.

  • Free the map registers

    Now that you're completely done, you call FreeMapRegisters to release the map registers that were allocated.  Note that you only need to do this if your ExecutionRoutine returned DeallocateObjectKeepMapRegisters.

  • Start the next request

    And now you're free to start the next operation.  This might be IoStartNextPacket, or it may be more complicated

Simple, huh?  Okay - it's a big pain in the butt.  I hate the old DMA DDIs and i try never to use them. 

GetScatterGatherList and PutScatterGatherList do most of this work for you, and allow you to run more than one command at a time, so it's a much better starting point.  I'll probably talk about them next time.

Let me say that again - DON'T USE THESE DDIS UNLESS YOU CAN'T AVOID IT.  They're much too complicated for general use, and the last thing you want when you're writing a device driver is more complexity.

-p


Packet Based DMA

Last time i talked about using common buffer to stage your DMA operations.  Doing this allows you to coallesce very fragmented packets, which can be very valuable, but it does complicate your DMA operations.  After all someone has to manage the common buffer that you allocated.

The alternative is known as Packet Based DMA in the DDK.  In Packet Based DMA you ask Windows to prepare each DMA "packet" for a transfer to or from your device.  You provide the buffer you want to map and a callback "ExecutionRoutine".  Windows will look at how many map registers the transfer requires and how many you have available.  If there aren't enough Windows will queue your request until they free up.

Once there are enough free registers, Windows will invoke your "ExecutionRoutine".  This is your driver's cue to start the DMA transfer.  When you are done with the transfer you call the DMA DDI again to return all of the map registers to the pool.  At this point Windows may invoke the ExecutionRoutine for a subsequent request.

So what is a "packet" then?  It's any unit of transfer that is:

  • Uni-directional (Packet based DDI doesn't support bi-directional transfers)
  • Fits within the number of map registers you were granted when you created the DMA_ADAPTER (NumberOfMapRegisters * PAGE_SIZE)
  • Represents a reasonable unit of transfer for your device

I realize the last one is pretty vague.  Unfortunately i can't answer what is a good unit for a given device.  For most SCSI controllers, the unit of transfer is a single SCSI operation.  For a network controller a unit of transfer might be a single packet, or might be a whole sequence of packets.

The DMA DDI has two pairs of functions for this allocate/release pairing.  The older set of DDIs are:

  • AllocateAdapterChannel
  • MapTransfer
  • FlushAdapterBuffers
  • FreeAdapterChannel

The newer set of DDIs are:

  • GetScatterGatherList (or BuildScatterGatherList)
  • PutScatterGatherList

Each set has its own strengths and weakneses.  The first set works with slave-mode DMA as well as bus-mastering DMA.  However the second set are much simpler to use as they generate an entire scatter-gather list for you automatically.

I'll talk more about the two options next time.

-p

 

The DMA API also allows you to create a section of kernel memory which you can share between your driver and your device.  This memory is known as "common buffer", and has a variety of uses with modern PCI devices.  You can allocate a piece of common buffer by calling the AllocateCommonBuffer function in your DMA_ADAPTER object.  This function takes a length and returns the virtual and logical address of your new buffer.

Common Buffer has four unique attributes that make it useful:

  1. The buffer is physically contiguous.
  2. The buffer is created in a physical address range that your device can access.
  3. Changes your driver makes to the common buffer are visible by your device and vice versa.
  4. You don't need an available map register to make use of it.

The first two attributes cannot be reproduced with any other WDM DDI.  MmAllocateContiguousMemory is the closest competitor, but because it's not tied into the HAL it can't determine what the correct range of physical addresses are for your device.  The second two are what make this really useful as a shared buffer.

The biggest downsides of common buffer are that it can't be allocated at DISPATCH_LEVEL, that it's hard to get because physical memory fragments quickly, and it can be a scarce resource so you don't want to allocate huge amounts of it.  Because of the first two issues you'll probably want to allocate a slab of common buffer during device initialization and then sub-allocate blocks out of that for the various operations.  This can be simple if you can break the common buffer into fixed size blocks (you could then stick them on a lookaside list) or you may find yourself writing your own malloc/free functions.

Because of the last limitation you may find yourself required to scale down what your device can do so you're not allocating 1GB of common buffer.  If you're splitting it up into command packets then the amount of common buffer you allocate will limit the number of requests you can send to the device at one time.

Using Common Buffer to hold command packets

The fact that changes made by one side are visible to the other allow you to store commands in the shared section.  Let's take as an example something that was common in storage adapters (several years ago when I worked with them).  Your driver writes a small command packet for the device, which contains:

  • The parameters for the command (a SCSI command descriptor block)
  • A "driver context" parameter which the device uses to report completion (you might use the virtual address of the packet)
  • Space to save the result of the operation (number of bytes transferred, status, etc..)
  • Some scratch space for the device to use (to hold the state of the operation, pointer to the next operation, etc...)
  • The list of logical addresses & lengths which make up the data buffer (the "scatter-gather list"). 

Setting up the packet will probably be ignored by the device.  When you're ready to start the operation you write its address to a register on the device.  This triggers the device to start processing the command.  When the device is done it interrupts.  Your driver reads the "driver context" value of the request that completed from a register, reclaims the packet and completes the original request.

In this example there are two things that I'd like to point out about the usefulness of common buffer.  First - having the shared memory section setup makes it very simple and efficient to get these command packets and share them with the device.  Doing this with non-paged pool would be much more complex since you'd need to call the DMA DDIs first to get a logical address for the packet (which could copy it into a bounce buffer) and call it again when you were done to undo the translation.

Second - since you didn't have to call the DMA DDI to translate your buffer and get a logical address, you didn't have to worry about whether you could find a free map register.  Remember that all of your attempts to translate buffers compete for the same pool of map registers.  If you were to translate the data buffer and then the command buffer you could end up in a deadlock situation - the data buffer translation could suck up the available registers but you can't release them until the command buffer translation completes.  There are some tricks you could use to fix this, but it's better to keep the command data in the common buffer.

Using Common Buffer to setup a continuous transfer

Let me first admit I've never done this before myself.  It's more of an audio thing than a storage thing, but I think I can still explain the idea.

Say you have a device which processes data in a continuous stream.  The best example of this might be a sound card which runs in a DMA loop, sucking up audio data and pushing it out to the speakers.  Such a device might not interrupt when it's done with a particular "transfer" but instead interrupt every time it's done processing a particular amount of data.  Rather than issuing individual commands with data buffers you would instead compose the data from various requests into a single stream of data for the device.

For such a device the traditional system of translating data buffers and programming scatter gather lists doesn't work very well.  Once a buffer has been translated it can't be modified anymore (since some of the pages may have been copied into bounce buffers ... you can modify the buffer but the bounced pages won't be updated), so you can't do the composition.

Here is another place where common buffer can help you.  You can setup your device to transfer from common buffer in a continuous loop and then copy the data to be processed into this buffer at the appropriate offset.  Since you don't have to do any extra work to make the common buffer useful you could write your data into the buffer and the device will pick it up as it sweeps through.  Assuming the device sweeps through the buffer at a predictable rate you should be able to figure out where to write the next bits of data as longs as there's some mechanism to synchronize your clock with the device's clock once in a while.

Using Common Buffer to coalesce buffers

The DMA DDI gives you two options for doing bus-mastered transfers.  If you say you support scatter-gather I/O in your device description (when you get the DMA_ADAPTER) then the DMA DDI will leave a request physically fragmented.  If you don't the DMA DDI will coalesce the entire thing into a single physically contiguous buffer for you, but it also serializes requests so that it doesn't need more than one buffer to do this.

If you want something in the middle then you're going to have to handle it on the own.  Say your device can only handle 5 fragments for a given DMA operation but you get a request with 6 fragments.  Or say you require the fragments to be page aligned, but you're trying to support chained MDLs (which, in summary, means that you may get buffer fragments that aren't page aligned).  None of these can be handled by the Windows DMA engine.

To handle this your driver can, once again, turn to common buffer.  If you get a request that you can't handle normally, you can attempt to sub-allocate a single block out of your common buffer and then copy data from the original buffer into the block you just allocated.  Now you can program the device with a single physical address and overcome the limitation.

This appears to be a pretty common practice in the networking space, where chained MDLs can result in transfers that consist of several tiny fragments with the various headers attached to the network packet.

Previously in this sequence I talked some about what DMA is, and some of the common models for programming DMA on a device.

Like most code, your driver usually deals with virtual addresses for data buffers.  Your DMA engine (be it slave or bus-mastering) is on the other side of the MMU and so can't virtual addresses very well.  You might think you should grab the physical address of your buffer and program that onto the device instead, but that's also going to cause problems.  The simplest example is a 32-bit PCI card on a 64-bit system - this controller cannot handle a physical address above 4GB but nothing stops an app from giving you buffers in this range.  Clearly you need to be ready to do some translation1.

WDM provides a mechanism for doing this translation - the DMA_ADAPTER object.  To get one of these for your device, you would call IoGetDmaAdapterObject.  This takes a description of the DMA capabilities of your device & information about your maximum transfer size and returns to you a pointer to the DMA_ADAPTER which in turn contains pointers to the other DMA functions you can call.  The maximum transfer size is expressed in terms of the number of "Map Registers" that you want to allocate.

Map Registers

Map registers are an abstraction the DMA API uses to track the system resources needed to make one page of memory accessible by your device for a DMA transfer.  They may represent a bounce buffer - a single page of memory which the device can access that the DMA will use to double-buffer part of your transfer.  They could (in the world of the future) represent entries in a page map that maps pages in the physical address space into the device's logical address space (another DDK term).  Or in the case of a 32-bit adapter on a 32-bit system where there's no need for translation, it might represent absolutely nothing at all.  However since you probably want to write a driver that makes your device work on any Windows system, you should ignore this last case and focus on the ones where translation is needed.

You'll want to allocate enough map registers to handle your maximum transfer size.  This limit might be exposed by your hardware, or as a tunable parameter in the registry, or just by common sense (you probably don't need to transfer 1GB in a single shot now do you?).  However since map registers can be a limited resource, you may not always get the number you asked for (it's an in/out parameter to IoGetDmaAdapter).  In that case you'll need to cut down your maximum transfer size - either rejecting larger transfers or breaking them up into smaller pieces and staging them.

So lets say your device can handle a transfer up to 64KB.  You ask for 16 map registers, right?  Not necessarily - it depends on what alignment you need for the DMA.  If you can handle buffers with byte alignment then 16 won't quite cut it - a 64KB transfer that's not page aligned will span 17 pages instead of 16.  This will ensure you can map the entire transfer.

The DMA API keeps track of how many of your map registers you're using at any given time.  The functions you call to allocate enough map registers for a DMA translation (AllocateAdapterChannel, GetScatterGatherList & BuildScatterGatherList) keep track of how many in use and call you back when there are sufficient resources available for the operation.  In the ideal case (where no translation is needed), you'll be called back immediately.  In the degenerate case where everything requires translation you may only be processing one request at a time.  However the nice part is that your driver can behave the same regardless of which situation you're in.


1-There are other conditions that can cause this.  If you have a controller which used DAC to get at all 64-bits of memory hooked onto a bridge that doesn't properly support DAC (these do exist) your card might be in 32-bit mode anyway.  Some day we may be able to block devices from transferring to or from main memory unless the OS has granted them access, and that will probably require some translation as well.  There are a few exceptions to this, but for the most part just accept that you'll have to do some translation.

Yesterday i talked a little about "what DMA is".  Today i want to talk a little bit about how devices use DMA)

DMA to a Driver

From the driver's point of view there are two aspects to DMA. The first is how you prepare your data for DMA transfers. The second is how you program the device to initiate the transfers & how you notice that a transfer is done. Let's talk about the second part first.

There are an infinite number of models for programming your device to start a DMA. Each introduces its own limitations. I'll go over a few of the common ones i've seen:

  1. The device takes a single physical address base and a length for an operation. This is very simple to program, but requres the transfer to be physically contiguous, which is unlikely for anything other than the smallest transfers (physical memory is often very fragmented, so the chance of two adjoining virtual pages using adjoining physical pages is pretty small). The device will usually interrupt when the DMA transfer is complete.
  2. The device takes a single physical address base & a length for each fragment of an operation. It interrupts when it's done transferring each fragment, allowing your driver to program in the next one. This is going to be slow because of the latency between each fragment, but is still easy to implement.
  3. The device takes a sequence of (physical-address, length) pairs which describe all the fragments of the transfer. This sequence is called a "scatter-gather list" (SG List). The device can then transfer each fragment on its own without the need to interrupt the CPU until all sections are done. In the simplest version of this, the driver programs the SG list to the controller through its registers/ports - writing each element into the device's internal memory. The device will only have a limited space for the SG list, so you may only be able to handle 16 fragments in a given transfer.
  4. In the more complex version of 3, the SG list itself is stored in DMA accessible system memory and the device is programmed with the physical address and length of the scatter-gather list itself. The device can then use DMA to transfer the SG list entries into its own internal buffers. This can reduce the limitations on the length of the SG list, but requires more complex logic in the DMA controller to handle it. However this would require the memory holding the SG list to be physically contiguous.

All of these models have the same basic characteristics.  You tell the controller one or more physical address ranges from/to which to transfer data & you tell it to start transferring data.  Some time in the future the transfer finishes and your driver finds out about it somehow.  Hopefully this "somehow" is through an interrupt but it might also involve polling.  The problem with polling is that you are, once again, wasting a very expensive CPU doing something mundane - in this case spinning and waiting on a bit in a register.

Next time i'll talk some about how you get those physical address ranges in the first place.

-p

So if this looks different each time you check the web page, please bear with me.  I'm having trouble finding a skin that i like.

What Is DMA?

DMA is a way for you to offload the work of transferring data between main memory and the device onto your device. This is in contrast to programmed I/O (PIO) where you have the processor copying data between main memory and the device.

PIO results in high data-rates (processors are fairly good at moving data from A to B), but since you're effectively running memcpy() for every transfer.  For a larger transfers, it's better to offload this to some other unit which can move the data from A to B and then signal (preferably with an interrupt) when the transfer is done.

Someone paid a lot of money for a CPU that can do complicated things - math, comparisons, branches, etc...  It's better to leave it free to do this complex stuff and offload the mundane work of moving data between A and B to a cheaper component dedicated to that task.

Flavors of DMA

There are two flavors of DMA - slave-mode and bus-mastering.  In bus-mastering DMA your device initiates the bus cycles that read from or write to main memory, just like a processor might.  In slave-mode your device depends on some system component to do the transfers.

Slave mode transfers make a device cheaper because it doesn't have to include a DMA controller of its own.  However they're very limiting - you have to share this separate controller across all devices, your device can't do much to slow the rate at which data is transferred.  Finally (i believe) the PC DMA controller was limited to a 24-bit address space and required contiguous buffers so it really doesn't scale well for modern PCs.

For Bus-Mastering DMA you place a "DMA Controller" on your device to run the DMA cycles for you. The device will steal some bus time and initiate a memory transfer as if it were another CPU. Data is transferred directly from main-memory into the device's memory ranges. You can have multiple bus-masters running independently of each other - they share the memory bus using some common protocol.  This is more effiicent than having all your devices fight over a single transfer agent (whether it's the CPU (PIO) or a separate DMA controller (slave-mode)).

Part 2 will talk about the various models i've seen for using DMA when programming a controller

I've been trying to find time to start a blog for a while now.  I figure the first question i'll be asked is "who are you and why should I care?".  Like all things the answer to that isn't particularly easy.

If you're not interested in driver development, then you probably don't care & should spend time reading a more interesting blog.  If you're interested in device drivers for Windows, then I'm hoping I may be of some assistance.  I've made many, many mistakes while writing Windows drivers over the last 10+ years and I've tried to learn from all of them.

I joined MS straight out of college in 1995.  I started working in the Windows NT device drivers team - which consisted of 5 other developers at the time.  If it wasn't video or networking, our team most likely handled the drivers for it.

I started out working on SCSI miniport drivers - more specifically helping to maintain the third-party SCSI miniports that we shipped in-box.  This was largely a diagnostic position, debugging problems when they came up in house, reviewing & integrating changes into the Windows source code base.  It was a good learning experience to work with two separate code bases - the Windows base which was reasonably clean and comprehensible, and the device driver code which could be ... opaque ... at times.

I worked on some other driver stacks too.  I owned the Parallel port driver for a while, as well as the drivers for the 8042 controller (a piece of hardware which still gives me a stomachache to this day.) 

For Windows 2000 I did the bulk of the work converting the storage drivers over to Plug-and-Play.  I can be blamed for such things as classpnp.sys, and the AdapterControl routines in SCSI miniports.  I worked as the Development Lead for the mass-storage drivers team, who worked on all of the software between volume management (ftdisk and/or LVM) and the underlying controllers.

A few years ago I helped to start up the Windows Driver Foundation team.  The goal of this team was to investigate technological solutions to improving the quality of device drivers using a variety of approaches.  This includes the static verification tools like SDV and PreFast for Drivers (PFD), as well as the Windows Driver Framework (WDF) implementations in User-Mode and Kernel-Mode.

Currently I am the Development Lead for the User-Mode Driver Framework.  This is an implementation of a subset of the WDF that allows developers to write drivers for some device types which will run in user-mode.  If you think about your typical Windows system, there are a number of device drivers running on it which don't really need to be in the Kernel, and we're trying to provide an alternate system which builds on the design patterns already set down in WDF.

So that said, my desire here is to talk about Windows device drivers - the things that confused me as a starting developer, the things that I get questions about frequently, the things that I love about it and the things which drive me absolutely nuts.

-p