kernelcl.py @ d2bb604

core_shell_microgelscostrafo411magnetic_modelrelease_v0.94release_v0.95ticket-1257-vesicle-productticket_1156ticket_1265_superballticket_822_more_unit_tests

Last change on this file since d2bb604 was d2bb604, checked in by Paul Kienzle <pkienzle@…>, 8 years ago
fix models so all dll tests pass
Property mode set to `100644`
File size: 19.6 KB

Rev	Line
[14de349]	1	"""
[eafc9fa]	2	GPU driver for C kernels
[14de349]	3
	4	There should be a single GPU environment running on the system. This
	5	environment is constructed on the first call to :func:`env`, and the
	6	same environment is returned on each call.
	7
	8	After retrieving the environment, the next step is to create the kernel.
	9	This is done with a call to :meth:`GpuEnvironment.make_kernel`, which
	10	returns the type of data used by the kernel.
	11
	12	Next a :class:`GpuData` object should be created with the correct kind
	13	of data. This data object can be used by multiple kernels, for example,
	14	if the target model is a weighted sum of multiple kernels. The data
	15	should include any extra evaluation points required to compute the proper
	16	data smearing. This need not match the square grid for 2D data if there
	17	is an index saying which q points are active.
	18
	19	Together the GpuData, the program, and a device form a :class:`GpuKernel`.
	20	This kernel is used during fitting, receiving new sets of parameters and
	21	evaluating them. The output value is stored in an output buffer on the
	22	devices, where it can be combined with other structure factors and form
	23	factors and have instrumental resolution effects applied.
[92da231]	24
	25	In order to use OpenCL for your models, you will need OpenCL drivers for
	26	your machine. These should be available from your graphics card vendor.
	27	Intel provides OpenCL drivers for CPUs as well as their integrated HD
	28	graphics chipsets. AMD also provides drivers for Intel CPUs, but as of
	29	this writing the performance is lacking compared to the Intel drivers.
	30	NVidia combines drivers for CUDA and OpenCL in one package. The result
	31	is a bit messy if you have multiple drivers installed. You can see which
	32	drivers are available by starting python and running:
	33
	34	import pyopencl as cl
	35	cl.create_some_context(interactive=True)
	36
	37	Once you have done that, it will show the available drivers which you
	38	can select. It will then tell you that you can use these drivers
	39	automatically by setting the PYOPENCL_CTX environment variable.
	40
	41	Some graphics cards have multiple devices on the same card. You cannot
	42	yet use both of them concurrently to evaluate models, but you can run
	43	the program twice using a different device for each session.
	44
	45	OpenCL kernels are compiled when needed by the device driver. Some
	46	drivers produce compiler output even when there is no error. You
	47	can see the output by setting PYOPENCL_COMPILER_OUTPUT=1. It should be
	48	harmless, albeit annoying.
[14de349]	49	"""
[ba32cdd]	50	from __future__ import print_function
[250fa25]	51	import os
	52	import warnings
	53
[14de349]	54	import numpy as np
[b3f6bc3]	55
[250fa25]	56	try:
[d2bb604]	57	raise NotImplementedError("OpenCL not yet implemented for new kernel template")
[250fa25]	58	import pyopencl as cl
[3c56da87]	59	# Ask OpenCL for the default context so that we know that one exists
	60	cl.create_some_context(interactive=False)
[9404dd3]	61	except Exception as exc:
[7841376]	62	warnings.warn(str(exc))
[664c8e7]	63	raise RuntimeError("OpenCL not available")
[7841376]	64
[14de349]	65	from pyopencl import mem_flags as mf
[5d316e9]	66	from pyopencl.characterize import get_fast_inaccurate_build_options
[14de349]	67
[cb6ecf4]	68	from . import generate
[14de349]	69
[ce27e21]	70	# The max loops number is limited by the amount of local memory available
	71	# on the device. You don't want to make this value too big because it will
	72	# waste resources, nor too small because it may interfere with users trying
	73	# to do their polydispersity calculations. A value of 1024 should be much
	74	# larger than necessary given that cost grows as npts^k where k is the number
	75	# of polydisperse parameters.
[5d4777d]	76	MAX_LOOPS = 2048
	77
[ce27e21]	78
[5464d68]	79	# Pragmas for enable OpenCL features. Be sure to protect them so that they
	80	# still compile even if OpenCL is not present.
	81	_F16_PRAGMA = """\
	82	#if defined(__OPENCL_VERSION__) // && !defined(cl_khr_fp16)
	83	# pragma OPENCL EXTENSION cl_khr_fp16: enable
	84	#endif
	85	"""
	86
	87	_F64_PRAGMA = """\
	88	#if defined(__OPENCL_VERSION__) // && !defined(cl_khr_fp64)
	89	# pragma OPENCL EXTENSION cl_khr_fp64: enable
	90	#endif
	91	"""
	92
	93
[14de349]	94	ENV = None
	95	def environment():
	96	"""
	97	Returns a singleton :class:`GpuEnvironment`.
	98
	99	This provides an OpenCL context and one queue per device.
	100	"""
	101	global ENV
	102	if ENV is None:
	103	ENV = GpuEnvironment()
	104	return ENV
	105
[5d316e9]	106	def has_type(device, dtype):
[14de349]	107	"""
[5d316e9]	108	Return true if device supports the requested precision.
[14de349]	109	"""
[5d316e9]	110	if dtype == generate.F32:
	111	return True
	112	elif dtype == generate.F64:
	113	return "cl_khr_fp64" in device.extensions
	114	elif dtype == generate.F16:
	115	return "cl_khr_fp16" in device.extensions
	116	else:
	117	return False
[14de349]	118
[f5b9a6b]	119	def get_warp(kernel, queue):
	120	"""
	121	Return the size of an execution batch for kernel running on queue.
	122	"""
[750ffa5]	123	return kernel.get_work_group_info(
[63b32bb]	124	cl.kernel_work_group_info.PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
	125	queue.device)
[14de349]	126
[f5b9a6b]	127	def _stretch_input(vector, dtype, extra=1e-3, boundary=32):
[14de349]	128	"""
	129	Stretch an input vector to the correct boundary.
	130
	131	Performance on the kernels can drop by a factor of two or more if the
	132	number of values to compute does not fall on a nice power of two
[f5b9a6b]	133	boundary. The trailing additional vector elements are given a
	134	value of extra, and so f(extra) will be computed for each of
	135	them. The returned array will thus be a subset of the computed array.
	136
	137	boundary should be a power of 2 which is at least 32 for good
	138	performance on current platforms (as of Jan 2015). It should
	139	probably be the max of get_warp(kernel,queue) and
	140	device.min_data_type_align_size//4.
	141	"""
[c85db69]	142	remainder = vector.size % boundary
[f5b9a6b]	143	if remainder != 0:
	144	size = vector.size + (boundary - remainder)
[c85db69]	145	vector = np.hstack((vector, [extra] * (size - vector.size)))
[14de349]	146	return np.ascontiguousarray(vector, dtype=dtype)
	147
	148
[5d316e9]	149	def compile_model(context, source, dtype, fast=False):
[14de349]	150	"""
	151	Build a model to run on the gpu.
	152
	153	Returns the compiled program and its type. The returned type will
	154	be float32 even if the desired type is float64 if any of the
	155	devices in the context do not support the cl_khr_fp64 extension.
	156	"""
	157	dtype = np.dtype(dtype)
[5d316e9]	158	if not all(has_type(d, dtype) for d in context.devices):
	159	raise RuntimeError("%s not supported for devices"%dtype)
[14de349]	160
[5464d68]	161	source_list = [generate.convert_type(source, dtype)]
	162
	163	if dtype == generate.F16:
	164	source_list.insert(0, _F16_PRAGMA)
	165	elif dtype == generate.F64:
	166	source_list.insert(0, _F64_PRAGMA)
	167
[14de349]	168	# Note: USE_SINCOS makes the intel cpu slower under opencl
	169	if context.devices[0].type == cl.device_type.GPU:
[5464d68]	170	source_list.insert(0, "#define USE_SINCOS\n")
[5d316e9]	171	options = (get_fast_inaccurate_build_options(context.devices[0])
	172	if fast else [])
[ba32cdd]	173	source = "\n".join(source_list)
[5d316e9]	174	program = cl.Program(context, source).build(options=options)
[ce27e21]	175	return program
[14de349]	176
	177
	178	# for now, this returns one device in the context
	179	# TODO: create a context that contains all devices on all platforms
	180	class GpuEnvironment(object):
	181	"""
	182	GPU context, with possibly many devices, and one queue per device.
	183	"""
	184	def __init__(self):
[250fa25]	185	# find gpu context
	186	#self.context = cl.create_some_context()
	187
	188	self.context = None
	189	if 'PYOPENCL_CTX' in os.environ:
	190	self._create_some_context()
	191
	192	if not self.context:
[3c56da87]	193	self.context = _get_default_context()
[250fa25]	194
[f5b9a6b]	195	# Byte boundary for data alignment
	196	#self.data_boundary = max(d.min_data_type_align_size
	197	# for d in self.context.devices)
[d18582e]	198	self.queues = [cl.CommandQueue(context, context.devices[0])
	199	for context in self.context]
[ce27e21]	200	self.compiled = {}
	201
[5d316e9]	202	def has_type(self, dtype):
[eafc9fa]	203	"""
	204	Return True if all devices support a given type.
	205	"""
[cde11f0f]	206	dtype = generate.F32 if dtype == 'fast' else np.dtype(dtype)
[d18582e]	207	return any(has_type(d, dtype)
	208	for context in self.context
	209	for d in context.devices)
	210
	211	def get_queue(self, dtype):
	212	"""
	213	Return a command queue for the kernels of type dtype.
	214	"""
	215	for context, queue in zip(self.context, self.queues):
	216	if all(has_type(d, dtype) for d in context.devices):
	217	return queue
	218
	219	def get_context(self, dtype):
	220	"""
	221	Return a OpenCL context for the kernels of type dtype.
	222	"""
	223	for context, queue in zip(self.context, self.queues):
	224	if all(has_type(d, dtype) for d in context.devices):
	225	return context
[5d316e9]	226
[250fa25]	227	def _create_some_context(self):
[eafc9fa]	228	"""
	229	Protected call to cl.create_some_context without interactivity. Use
	230	this if PYOPENCL_CTX is set in the environment. Sets the context
	231	attribute.
	232	"""
[250fa25]	233	try:
[d18582e]	234	self.context = [cl.create_some_context(interactive=False)]
[9404dd3]	235	except Exception as exc:
[250fa25]	236	warnings.warn(str(exc))
	237	warnings.warn("pyopencl.create_some_context() failed")
	238	warnings.warn("the environment variable 'PYOPENCL_CTX' might not be set correctly")
	239
[5d316e9]	240	def compile_program(self, name, source, dtype, fast=False):
[eafc9fa]	241	"""
	242	Compile the program for the device in the given context.
	243	"""
[cde11f0f]	244	key = "%s-%s-%s"%(name, dtype, fast)
	245	if key not in self.compiled:
[fec69dd]	246	print("compiling",name)
[cde11f0f]	247	dtype = np.dtype(dtype)
[fec69dd]	248	program = compile_model(self.get_context(dtype),
	249	str(source), dtype, fast)
[cde11f0f]	250	self.compiled[key] = program
	251	return self.compiled[key]
[ce27e21]	252
	253	def release_program(self, name):
[eafc9fa]	254	"""
	255	Free memory associated with the program on the device.
	256	"""
[ce27e21]	257	if name in self.compiled:
	258	self.compiled[name].release()
	259	del self.compiled[name]
[14de349]	260
[3c56da87]	261	def _get_default_context():
[eafc9fa]	262	"""
[d18582e]	263	Get an OpenCL context, preferring GPU over CPU, and preferring Intel
	264	drivers over AMD drivers.
[eafc9fa]	265	"""
[d18582e]	266	# Note: on mobile devices there is automatic clock scaling if either the
	267	# CPU or the GPU is underutilized; probably doesn't affect us, but we if
	268	# it did, it would mean that putting a busy loop on the CPU while the GPU
	269	# is running may increase throughput.
	270	#
	271	# Macbook pro, base install:
	272	# {'Apple': [Intel CPU, NVIDIA GPU]}
	273	# Macbook pro, base install:
	274	# {'Apple': [Intel CPU, Intel GPU]}
	275	# 2 x nvidia 295 with Intel and NVIDIA opencl drivers installed
	276	# {'Intel': [CPU], 'NVIDIA': [GPU, GPU, GPU, GPU]}
	277	gpu, cpu = None, None
[3c56da87]	278	for platform in cl.get_platforms():
[e6a5556]	279	# AMD provides a much weaker CPU driver than Intel/Apple, so avoid it.
	280	# If someone has bothered to install the AMD/NVIDIA drivers, prefer them over the integrated
	281	# graphics driver that may have been supplied with the CPU chipset.
	282	preferred_cpu = platform.vendor.startswith('Intel') or platform.vendor.startswith('Apple')
	283	preferred_gpu = platform.vendor.startswith('Advanced') or platform.vendor.startswith('NVIDIA')
[3c56da87]	284	for device in platform.get_devices():
	285	if device.type == cl.device_type.GPU:
[e6a5556]	286	# If the existing type is not GPU then it will be CUSTOM or ACCELERATOR,
	287	# so don't override it.
	288	if gpu is None or (preferred_gpu and gpu.type == cl.device_type.GPU):
	289	gpu = device
	290	elif device.type == cl.device_type.CPU:
	291	if cpu is None or preferred_cpu:
	292	cpu = device
[d18582e]	293	else:
[e6a5556]	294	# System has cl.device_type.ACCELERATOR or cl.device_type.CUSTOM
	295	# Intel Phi for example registers as an accelerator
	296	# Since the user installed a custom device on their system and went through the
	297	# pain of sorting out OpenCL drivers for it, lets assume they really do want to
	298	# use it as their primary compute device.
	299	gpu = device
[199d40d]	300
[e6a5556]	301	# order the devices by gpu then by cpu; when searching for an available device by data type they
	302	# will be checked in this order, which means that if the gpu supports double then the cpu will never
	303	# be used (though we may make it possible to explicitly request the cpu at some point).
	304	devices = []
	305	if gpu is not None:
	306	devices.append(gpu)
	307	if cpu is not None:
	308	devices.append(cpu)
	309	return [cl.Context([d]) for d in devices]
[3c56da87]	310
[250fa25]	311
[14de349]	312	class GpuModel(object):
	313	"""
	314	GPU wrapper for a single model.
	315
[17bbadd]	316	source and model_info are the model source and interface as returned
	317	from :func:`generate.make_source` and :func:`generate.make_model_info`.
[14de349]	318
	319	dtype is the desired model precision. Any numpy dtype for single
	320	or double precision floats will do, such as 'f', 'float32' or 'single'
	321	for single and 'd', 'float64' or 'double' for double. Double precision
	322	is an optional extension which may not be available on all devices.
[cde11f0f]	323	Half precision ('float16','half') may be available on some devices.
	324	Fast precision ('fast') is a loose version of single precision, indicating
	325	that the compiler is allowed to take shortcuts.
[14de349]	326	"""
[17bbadd]	327	def __init__(self, source, model_info, dtype=generate.F32):
	328	self.info = model_info
[ce27e21]	329	self.source = source
[823e620]	330	self.dtype = generate.F32 if dtype == 'fast' else np.dtype(dtype)
[cde11f0f]	331	self.fast = (dtype == 'fast')
[ce27e21]	332	self.program = None # delay program creation
[14de349]	333
[ce27e21]	334	def __getstate__(self):
[eafc9fa]	335	return self.info, self.source, self.dtype, self.fast
[14de349]	336
[ce27e21]	337	def __setstate__(self, state):
[eafc9fa]	338	self.info, self.source, self.dtype, self.fast = state
	339	self.program = None
[ce27e21]	340
[48fbd50]	341	def make_kernel(self, q_vectors):
[ce27e21]	342	if self.program is None:
[3c56da87]	343	compiler = environment().compile_program
[c072f83]	344	self.program = compiler(self.info['name'], self.source,
	345	self.dtype, self.fast)
[eafc9fa]	346	is_2d = len(q_vectors) == 2
	347	kernel_name = generate.kernel_name(self.info, is_2d)
[14de349]	348	kernel = getattr(self.program, kernel_name)
[48fbd50]	349	return GpuKernel(kernel, self.info, q_vectors, self.dtype)
[ce27e21]	350
	351	def release(self):
[eafc9fa]	352	"""
	353	Free the resources associated with the model.
	354	"""
[ce27e21]	355	if self.program is not None:
	356	environment().release_program(self.info['name'])
	357	self.program = None
[14de349]	358
[eafc9fa]	359	def __del__(self):
	360	self.release()
[14de349]	361
	362	# TODO: check that we don't need a destructor for buffers which go out of scope
	363	class GpuInput(object):
	364	"""
	365	Make q data available to the gpu.
	366
	367	q_vectors is a list of q vectors, which will be [q] for 1-D data,
	368	and [qx, qy] for 2-D data. Internally, the vectors will be reallocated
	369	to get the best performance on OpenCL, which may involve shifting and
	370	stretching the array to better match the memory architecture. Additional
	371	points will be evaluated with q=1e-3.
	372
	373	dtype is the data type for the q vectors. The data type should be
	374	set to match that of the kernel, which is an attribute of
	375	:class:`GpuProgram`. Note that not all kernels support double
	376	precision, so even if the program was created for double precision,
	377	the GpuProgram.dtype may be single precision.
	378
	379	Call :meth:`release` when complete. Even if not called directly, the
	380	buffer will be released when the data object is freed.
	381	"""
[cb6ecf4]	382	def __init__(self, q_vectors, dtype=generate.F32):
[17bbadd]	383	# TODO: do we ever need double precision q?
[14de349]	384	env = environment()
	385	self.nq = q_vectors[0].size
	386	self.dtype = np.dtype(dtype)
[eafc9fa]	387	self.is_2d = (len(q_vectors) == 2)
[f5b9a6b]	388	# TODO: stretch input based on get_warp()
	389	# not doing it now since warp depends on kernel, which is not known
	390	# at this point, so instead using 32, which is good on the set of
	391	# architectures tested so far.
[c072f83]	392	if self.is_2d:
	393	# Note: 17 rather than 15 because results is 2 elements
	394	# longer than input.
	395	width = ((self.nq+17)//16)*16
	396	self.q = np.empty((width, 2), dtype=dtype)
	397	self.q[:self.nq, 0] = q_vectors[0]
	398	self.q[:self.nq, 1] = q_vectors[1]
	399	else:
	400	# Note: 33 rather than 31 because results is 2 elements
	401	# longer than input.
	402	width = ((self.nq+33)//32)*32
	403	self.q = np.empty(width, dtype=dtype)
	404	self.q[:self.nq] = q_vectors[0]
	405	self.global_size = [self.q.shape[0]]
[d18582e]	406	context = env.get_context(self.dtype)
[c094758]	407	#print("creating inputs of size", self.global_size)
[c072f83]	408	self.q_b = cl.Buffer(context, mf.READ_ONLY \| mf.COPY_HOST_PTR,
	409	hostbuf=self.q)
[14de349]	410
	411	def release(self):
[eafc9fa]	412	"""
	413	Free the memory.
	414	"""
[c072f83]	415	if self.q is not None:
	416	self.q.release()
	417	self.q = None
[14de349]	418
[eafc9fa]	419	def __del__(self):
	420	self.release()
	421
[14de349]	422	class GpuKernel(object):
[ff7119b]	423	"""
	424	Callable SAS kernel.
	425
[eafc9fa]	426	kernel is the GpuKernel object to call
[ff7119b]	427
[17bbadd]	428	model_info is the module information
[ff7119b]	429
[eafc9fa]	430	q_vectors is the q vectors at which the kernel should be evaluated
	431
	432	dtype is the kernel precision
[ff7119b]	433
	434	The resulting call method takes the pars, a list of values for
	435	the fixed parameters to the kernel, and pd_pars, a list of (value,weight)
	436	vectors for the polydisperse parameters. cutoff determines the
	437	integration limits: any points with combined weight less than cutoff
	438	will not be calculated.
	439
	440	Call :meth:`release` when done with the kernel instance.
	441	"""
[48fbd50]	442	def __init__(self, kernel, model_info, q_vectors, dtype):
[ba32cdd]	443	max_pd = model_info['max_pd']
[c072f83]	444	npars = len(model_info['parameters'])-2
[eafc9fa]	445	q_input = GpuInput(q_vectors, dtype)
[c072f83]	446	self.dtype = dtype
[445d1c0]	447	self.dim = '2d' if q_input.is_2d else '1d'
[14de349]	448	self.kernel = kernel
[17bbadd]	449	self.info = model_info
[c072f83]	450	self.pd_stop_index = 4*max_pd-1
	451	# plus three for the normalization values
	452	self.result = np.empty(q_input.nq+3, q_input.dtype)
[14de349]	453
	454	# Inputs and outputs for each kernel call
[ce27e21]	455	# Note: res may be shorter than res_b if global_size != nq
	456	env = environment()
[d18582e]	457	self.queue = env.get_queue(dtype)
[c072f83]	458
[48fbd50]	459	# details is int32 data, padded to an 8 integer boundary
	460	size = ((max_pd5 + npars3 + 2 + 7)//8)*8
[c072f83]	461	self.result_b = cl.Buffer(self.queue.context, mf.READ_WRITE,
[d18582e]	462	q_input.global_size[0] * q_input.dtype.itemsize)
[c072f83]	463	self.q_input = q_input # allocated by GpuInput above
[14de349]	464
[48fbd50]	465	self._need_release = [ self.result_b, self.q_input ]
[d18582e]	466
[48fbd50]	467	def __call__(self, details, weights, values, cutoff):
[5d316e9]	468	real = (np.float32 if self.q_input.dtype == generate.F32
	469	else np.float64 if self.q_input.dtype == generate.F64
	470	else np.float16 if self.q_input.dtype == generate.F16
	471	else np.float32) # will never get here, so use np.float32
[48fbd50]	472	assert details.dtype == np.int32
	473	assert weights.dtype == real and values.dtype == real
	474
	475	context = self.queue.context
	476	details_b = cl.Buffer(context, mf.READ_ONLY \| mf.COPY_HOST_PTR,
	477	hostbuf=details)
	478	weights_b = cl.Buffer(context, mf.READ_ONLY \| mf.COPY_HOST_PTR,
	479	hostbuf=weights)
	480	values_b = cl.Buffer(context, mf.READ_ONLY \| mf.COPY_HOST_PTR,
	481	hostbuf=values)
	482
	483	start, stop = 0, self.details[self.pd_stop_index]
[c072f83]	484	args = [
[48fbd50]	485	np.uint32(self.q_input.nq), np.uint32(start), np.uint32(stop),
	486	self.details_b, self.weights_b, self.values_b,
	487	self.q_input.q_b, self.result_b, real(cutoff),
[c072f83]	488	]
[d18582e]	489	self.kernel(self.queue, self.q_input.global_size, None, *args)
[c072f83]	490	cl.enqueue_copy(self.queue, self.result, self.result_b)
[48fbd50]	491	[v.release() for v in details_b, weights_b, values_b]
[14de349]	492
[c072f83]	493	return self.result[:self.nq]
[14de349]	494
	495	def release(self):
[eafc9fa]	496	"""
	497	Release resources associated with the kernel.
	498	"""
[d18582e]	499	for v in self._need_release:
	500	v.release()
	501	self._need_release = []
[14de349]	502
	503	def __del__(self):
	504	self.release()

Note: See TracBrowser for help on using the repository browser.

SasView

source: sasmodels/sasmodels/kernelcl.py @ d2bb604

Download in other formats: