The standard quasi newton update in fmin uses the gradients to update an approximation to the inverse hessian. This is I recall an o(n^2) operation, but only needs to trasnsfer o(n) double to the GPU. For models with say 5,000 parameters it might be a good candidate for using the GPU. It can be done in C, no C++ is needed.