使用部分缓冲区将多字节Unicode字节数组转换为NSString
在Objective C中,有一种将多字节Unicode字节数组转换为NSString的方法,即使数组数据是部分缓冲区(不是完整的字符边界),也可以使转换成功。
这个应用程序在接收stream中的字节缓冲区时,要分析数据缓冲区的string版本(但是会有更多的数据来,并且缓冲区数据没有完整的多字节Unicode)。
NSString的initWithData:encoding:
方法不适用于此目的,如下所示…
testing代码:
- (void)test { char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'}; size_t sizeOfMyArray = sizeof(myArray); [self dump:myArray sizeOfMyArray:sizeOfMyArray]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5]; } - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength { NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding]; NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string); }
输出:
sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar' sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba' sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b' sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×' sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)' sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'
可以看出,转换“sourceLength:4字节”字节数组失败,并返回(null)
。 这是因为UTF-8 unicode“×”字符(0xc3 0x97)只是部分包含在内。
理想情况下,会有一个函数,我可以使用这将返回正确的NString,并告诉我有多less字节“剩余”。
你基本上有你自己的答案。 如果initWithData:dataWithBytes:encoding:
方法返回nil
,那么您知道缓冲区末尾有一个部分(无效)字符。
修改dump
以返回一个int
。 然后试图在循环中创buildNSString
。 每次你得到nil
,减less长度,然后再试一次。 一旦你得到一个有效的NSString
,返回使用的长度和传递的长度之间的差异。
我之前有过这个问题,忘了一段时间。 这是一个机会去做。 下面的代码是用wikipedia上的utf-8页面的信息完成的。 这是一个关于NSData的类别。
它检查从最后的数据,只有最后四个字节,因为OP说,它可以是千兆字节的数据。 否则,使用utf-8从头开始运行字节会更简单。
/* Return the range of a valid utf-8 encoded text by removing partial trailing multi-byte char. It assumes that all the bytes are valid utf-8 encoded char, eg it don't raise a flag if a continuation byte is preceded by a single char byte. */ - (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes { NSRange validRange = {0, 0}; NSUInteger trailLength = MIN([self length], 4U); unsigned char trail[4]; [self getBytes:&trail range:NSMakeRange([self length] - trailLength, trailLength)]; unsigned multibyteCount = 0; for (NSInteger i = trailLength - 1; i >= 0; i--) { if (isUTF8SingleByte(trail[i])) { validRange = NSMakeRange(0, [self length] - trailLength + i + 1); break; } if (isUTF8ContinuationByte(trail[i])) { multibyteCount++; continue; } if (isUTF8StartByte(trail[i])) { multibyteCount++; if (multibyteCount == lengthForUTF8StartByte(trail[i])) { validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount); } else { validRange = NSMakeRange(0, [self length] - trailLength + i); } break; } } return validRange; }
这里是方法中使用的静态函数:
static BOOL isUTF8SingleByte(const unsigned char c) { return c <= 0x7f; } static BOOL isUTF8ContinuationByte(const unsigned char c) { return (c >= 0x80) && (c <= 0xbf); } static BOOL isUTF8StartByte(const unsigned char c) { return (c >= 0xc2) && (c <= 0xf4); } static BOOL isUTF8InvalidByte(const unsigned char c) { return (c == 0xc0) || (c == 0xc1) || (c > 0xf4); } static unsigned lengthForUTF8StartByte(const unsigned char c) { if ((c >= 0xc2) && (c <= 0xdf)) { return 2; } else if ((c >= 0xe0) && (c <= 0xef)) { return 3; } else if ((c >= 0xf0) && (c <= 0xf4)) { return 4; } return 1; }
这是我的低效执行,我不认为这是一个正确的答案。 我会把它留在这里,以防别人觉得有用(并希望别人会给出比这更好的答案!)
它在NSMutableData
的类别中…
/** * Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind. * * NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer, * it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be * attempted. */ - (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding { if (self.length > 0) { // Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation). NSString *result = [[NSString alloc] initWithData:self encoding:encoding]; if (result != Nil) { self.length = 0; // Simple case, we used the whole buffer. return result; } // Try to find the largest subData that is a valid string. for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) { NSRange subDataRange = NSMakeRange(0, subDataLength); result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding]; if (result != Nil) { // Delete the bytes we used from our buffer, leave the remainder. [self replaceBytesInRange:subDataRange withBytes:Nil length:0]; return result; } } } return @""; }